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Foreword 



It would appear we have reached the limits 
of what is possible to achieve with com- 
puter technology, although one should be 
careful with such statements - they tend to 
sound pretty silly in five years. 

John von Neumann, 1949. 



Application-specific instruction set processors (ASIPs) have the poten- 
tial to become a key building block of future integrated circuits for dig- 
ital signal processing. ASIPs combine the flexibility and competitive 
time-to-market of embedded processors with the computational perfor- 
mance and energy-efficiency of dedicated VLSI hardware implementa- 
tions. Furthermore, ASIPs can easily be integrated into existing semi- 
custom design flows: the ASIP designer has full control of the imple- 
mentation and verification. As ASIPs replace commercial embedded 
processors, there is no need to pay royalties to third parties. 

This book was written for hard- and software design engineers as well 
as students with a fundamental knowledge of VLSI logic design. The 
benefits of ASIPs can only be exploited by designers with expertise 
in the fields of VLSI hardware, computer architecture, and embed- 
ded software design. This book provides the essential knowledge in 
each of these disciplines and focuses on the practical implementation 
of ASIPs for real-world applications. Many examples illustrate the pro- 
posed methodology; theoretic discussions are kept to the minimum. 

This book constitutes my Ph.D. thesis, which has been performed at the 
Institute for Integrated Signal Processing Systems at Aachen University 
of Technology (ISS/RWTH Aachen/Germany). My reviewers encour- 
aged me to extend my thesis and publish this comprehensive book about 
ASIP design. 

The first chapter of this book introduces the advantages of ASIPs and 
motivates the requirement for an elaborated design methodology. In 
Chapter 2, the focus of this work is described in detail and an overview 
of related work is given. Chapter 3 introduces and summarizes the 
basics of low-energy VLSI design. This chapter is a prerequisite for 
the design space definition of ASIPs and the discussion of critical fac- 
tors for energy-efficient ASIP architectures in Chapter 4. The proposed 
ASIP design flow is presented in Chapter 5 with a special focus on de- 
sign tasks to obtain an energy-efficient implementation. The LISA tool 




xii 



suite, which was developed at the ISS, and enhancements of these tools 
triggered by this work are presented in Chapter 6. The described tools 
support the generation of critical hardware parts in order to save en- 
ergy as well as the verification of the implemented ASIP hard- and soft- 
ware. Quantitative results of two case studies are given in Chapter 7, 
which prove the applicability of the proposed design flow and the devel- 
oped tools. The first case study demonstrates the impressive potential of 
ASIP performance and energy optimizations, whereas the second case 
study compares the architectural and implementation efficiency of two 
different ASIP design approaches. 
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Chapter 1 



Introduction 



In the last decade, integrated digital circuits have emerged as the com- 
putational core of many digital devices in everyday life. Examples are 
mobile phones, organizers, personal computers, networking devices and 
embedded systems for automotives and industrial automation. The eco- 
nomical importance of digital devices is steadily increasing with an av- 
erage annual growth rate in semiconductor sales of about 15% since the 
development of the microprocessor [38]. 

As Moore’s law is expected to be valid for at least the next decade [212], 
the capability and complexity of digital devices will continue to grow. 
However, the growth in design productivity for digital circuits cannot 
keep up with the technological growth [136]. This gap represents a 
serious bottleneck for the implementation of new competitive devices. 
Especially for embedded digital circuits, a shift from hardware to soft- 
ware implementations is a solution to this issue. Increasing the software 
part of a design improves the design productivity due to the simplicity 
of the software implementation process and due to the increased design 
reuse factor. Eurthermore, this shift to software in embedded systems 
(embedded software) enables more designers to participate in the im- 
plementation process. 

Moore’s law results in an increasing percentage of software-realizable 
implementations for all applications with a constant computational 
complexity' due to the exponential increase in available processing 
power. Embedded software typically has to meet tight real-time con- 
straints and/or high energy-efficiency requirements, especially for mo- 
bile appliances. Eor these two reasons, specialized embedded proces- 
sors like commercial DSPs or microcontrollers are becoming more and 
more popular. These embedded processors provide a significantly bet- 
ter cost/performance ratio than general purpose processors for desk- 
top applications. Nevertheless, embedded processors are still limited 

^This is also true for applications with exponentially increasing computational complexity provided that 
the exponential increase in computational complexity is smaller than the technological increase. 
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in terms of computational performance, because they use a fixed non- 
application- specific instruction set architecture. Furthermore, these 
processors typically expose poor energy-efficiency compared to more 
application- specific implementations, because they target a broad range 
of embedded applications. 

Energy-efficiency and flexibility are competing goals for a hardware 
implementation. Figure 1.1 depicts this tradeoff for several implemen- 
tation paradigms: embedded standard processors, DSPs, FPGAs and 
dedicated hardware. The so-called application-specific instruction set 
processors (ASIPs) are able to fill the energy-flexibility gap between 
dedicated hardware and programmable DSPs for a given application ac- 
cording to Figure 1.1. ASIPs take advantage of user-defined instructions 
and a user-defined data path optimized for a certain target application. 
The result of this optimization is a higher computational performance 
than general purpose approaches and a better energy-efficiency. This is 
one reason for the current industrial trend to use more and more cus- 
tomized processors [23]. This trend can be explained from the perspec- 
tive of both hardware and software designers. 

From the hardware designers’ point of view, ASIPs considerably facili- 
tate the implementation of tasks that require a high degree of flexibility. 
This flexibility is needed to track evolving standards and for implemen- 
tations that are prone to late design changes. Furthermore, the design 
time is decreased especially due to the high reuse factor of software- 
based implementations. This fact is particularly important for redesigns 
with the goal to implement distinguishing features in an existing product 
for competitive reasons. Finally, the ASIP tasks can be modeled with 
high level languages, which provide a rapid and methodical approach 
to the design of resource shared hardware. Synthesizable ASIPs are 
technology-independent and can easily be integrated in any established 
semi-custom design flow together with other hardware blocks. 

From the software point of view, ASIPs offer a new degree of freedom 
for optimization: The design input for ASIPs is both the software im- 
plementation in form of a high level language description as well as 
the ASIP hardware architecture in form of a hardware description lan- 
guage. The new degree of freedom for software designers, the hardware 
architecture, removes the traditional upper bound in computational per- 
formance of conventional fixed processor architectures by introducing 
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Figure 1.1: The Energy-Flexibility Gap (Source: [1] with modifications) 



scalability of processor resources. Therefore, oversized and energy- 
wasting fixed proeessor eores ean be replaeed by energy-effieient ASIPs 
to meet the performanee eonstraints of an embedded applieation. 

ASIP design is a eomplex optimization problem requiring expertise in 
VLSI logie, eomputer arehiteeture and applieation software design. The 
eomplexity of this design task makes it diffieult for the designer to ex- 
plore a large number of design alternatives in order to find an optimum 
implementation within a eompetitive design time. Furthermore, ASIP 
design for systems with tight energy eonstraints leads to additional eom- 
plexity, whieh aggravates this issue. 

This thesis presents a solution to this eomplexity problem by provid- 
ing an optimized design methodology for ASIPs eonsidering the typieal 
performanee and energy eonstraints of mobile embedded systems. The 
feasibility of the proposed design methodology is proven with two ease 
studies. Furthermore, typieal ASIP optimizations are introdueed and 
evaluated in order to assess the potential of best praetiee ASIP imple- 
mentations over fixed proeessor arehiteetures. 



Chapter 2 



Focus and Related Work 



This chapter presents the motivation and the focus of this thesis as 
well as the essential differences to existing approaches. Moreover, an 
overview of related work concerning ASIP design for low-energy con- 
sumption is given. 



2.1 Focus of This Work 

The focus of this work are application-specific instruction set processors 
(ASIPs) for embedded DSP applications with performance and energy 
constraints. Energy in this context refers to the energy that is consumed 
for a given well-defined computational task. This metric corresponds to 
the average power that is consumed for the same task. 

The proposed methodology primarily targets (but is not limited to) 
semi-custom designs. This design approach enables the use of a high 
level of abstraction for design entry, whereas the degrees of freedom for 
optimization are moderately decreased compared to full-custom design 
due to the constraints imposed by the standard cell library supplier. 

This work has the goal to answer the following scientific problems: 

• How much can be gained in performance and energy-efficiency 
using ASIPs instead of general purpose processors? 

• To which extend can energy-driven ASIP optimizations increase 
the energy-efficiency? 

• How can ASIPs be designed in order to meet the performance, 
energy and/or area constraints? 

• Does the proposed design methodology enable a competitive time- 
to-market? 
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In order to answer these questions, several case studies have been per- 
formed and extensive optimization techniques have been developed and 
evaluated. These optimization techniques include general low-power 
optimizations for dedicated hardware as well as ASIP-specific tech- 
niques. 

Furthermore, tool-based methodologies for the instruction encoding and 
for the generation of energy-critical ASIP parts as well as enhanced ver- 
ification techniques have been developed. This is especially important 
to obtain a competitive design time for the ASIP implementation. 



2.2 Previous Work 

At present, there is no publication covering a design methodology for 
ASIPs that enables to jointly optimize the performance, the silicon area 
and the power consumption. This fact is also emphasized by Jain [132]. 
The published work rather focuses on performance optimization, some 
publications also cover the tradeoff between performance and silicon 
area. 

Publications related to low-power ASIP design can be subdivided into 
the topics ASIP design methodologies, ASIP case studies, and basic 
low-power design techniques for general purpose processors and ded- 
icated hardware. Furthermore, ASIP verification (which is a subtopic 
of ASIP design methodologies) is of paramount importance to obtain 
working silicon and represents a tedious and time-consuming design 
task. The following subsections provide an overview of literature cov- 
ering these four topics. 



2.2.1 ASIP Design Methodologies 

This summary of ASIP design methodologies does not discuss in detail 
general purpose processor designs and significantly incomplete ASIP 
design environments without a path to hardware implementation like 
BUILDABONG [242] PARTITA [50], ISPS [22], [34], the work of En- 
gel [69] and of Bajot [20]. Furthermore, methodologies with a signifi- 
cantly incomplete software design tool chain like READ [145] as well 




2 . 2 . Previous Work 



1 



as various compiler-centric publications on ASIP code generation [102] 
[103] [52] [99] [130] [165] [166] [170] [208] [230] [247] [254] are not 
explicitly covered. 

Existing ASIP design environments can be differentiated according to 
the flexibility to support various processor classes. Many design en- 
vironments use predefined, largely invariant processor templates and 
software design tools, covering a limited ASIP design space. Other en- 
vironments provide generic processor description languages, which en- 
able the designer to add user-defined structures to an existing processor 
or to describe entirely new processor architectures, often at the expense 
of the quality of the available software design tools. 

Commercial approaches targeting largely fixed processor templates in- 
clude the Xtensa core of Tensilica [243] [97], the ManArray architec- 
ture of BOPS [31], the ARCtangent processor of ARC [12] [13], the 
Jazz processor of Improv [261] [160] and the R.E.A.L. DSP of Philips 
[141]. Eurther work on largely fixed processor classes include Elexware 
[202], the research project PICO [21 1] [2], Satsuki [225], and ASIA 
[116] [117]. 

Xtensa is a moderately parameterizable RISC (reduced instruction set 
computer) load/store architecture with variable length instructions (24 
or 16 bit), 3 operand instructions, and about 80 base instructions. Pa- 
rameters of the processor comprise the choice of a 32 or 64 general pur- 
pose register file, the size of caches, the write buffer size, the endianess 
and the availability of certain instructions like multiply-accumulate etc. 
The automatically generated design tools for each specific processor in- 
stance of Xtensa include a C compiler, assembler, linker and debugger. 
Eurthermore, the user can define new instructions and additional func- 
tional units using the TIE^ language. 

The ManArray architecture of BOPS uses the concept of a multi-pro- 
cessor system, which is optimized for DSP applications like wireless 
applications, multimedia and image processing. Each of the processor 
elements is a RISC core with a fixed so-called indirect VEIW^ (iVEIW) 
architecture, which is implemented by a VEIW-look-up-table and spe- 



^Tensilica Instruction Extension 

^VLIW is the abbreviation for very long instruction word. 
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cial 32 bit instructions to execute one of the stored VLIW instructions^. 
The instruction set supports typical DSP instruction, subword-level par- 
allelism, and also typical micro-controller features like bit manipula- 
tion and low-latency interrupts. The focus of the ManArray architec- 
ture is the scalability of a parameterizable number of tightly intercon- 
nected processing elements for regular algorithms requiring high com- 
putational performance. 

The ARCtangent-A4 microprocessor of ARC is a unified RISC/DSP 
core with a 4 stage pipeline architecture and configurable functional 
units, memories and an extensible instruction set. The core is delivered 
as a soft-core together with software development tools including a DSP 
function library. 

The Jazz processor of Improv is a customizable building block em- 
bedded into a generic platform (PSA - programmable system architec- 
ture) comprising several typically different instances of the processor 
together with data/instruction memory and I/O blocks. The Jazz pro- 
cessor represents a memory-register VLIW-architecture using a set of 
predefined computational units in combination with high bandwidth to 
data memory. For the specific instances the user can configure param- 
eters like the data width, the depth of the hardware task queue and the 
number and kind of computational units within certain constraints. For 
the selected architecture configuration, software design tools as well as 
synthesizable HDL code can be automatically generated. 

The R.E.A.L. DSP of Philips uses a customizable base architecture with 
2 multipliers and 4 ALUs in combination with a general purpose reg- 
ister file. Instruction formats with 16 and 32 bits as well as so-called 
ASI (Application-Specific Instructions), which allow up to 256 VLIW 
instructions stored in an internally customizable look-up table are sup- 
ported. The DSP programmer or the high level language (HLL) com- 
piler has to specify the part of the code for parallel execution. The ASI 
look-up table can be implemented using a RAM for prototypes or a 
ROM/synthesized block for the processor in the final product. 

The Llexware environment is an ASIP design environment based on 
a simple parameterizable processor template. Configuration parame- 

^This concept generalizes the idea of the CLIW (configurable long instruction word) architecture of 
CARMEL (Infineon Technologies [232]) 

^This approach is similar to the above-mentioned iVLIW concept. 
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ters include the bit width, the number of registers and the number of 
ALUs as well as the definition of new instructions. The environment 
provides the typical software design tool chain including the code gen- 
erator CodeSyn [201] and a hardware description generation back end. 
Simulation is performed using the VHDL model of the target processor. 

PICO as well as Trimaran [255] are both part of the compiler and ar- 
chitecture research group of HP Labs. PICO is an environment that 
automatically explores the design space for a heterogeneous processor- 
coprocessor system for applications written in C code. A synthesizable 
VHDL description for non-programmable processors as well as an op- 
timally configured instance of a VLIW processor called HPL-PD [137] 
is generated. The approach is limited to this processor type with a fixed 
instruction set, but it supports different memory and cache configura- 
tions. 

Satsuki is a design environment, which uses a moderately parameteriz- 
able processor template as target architecture. Parameters of this tem- 
plate are data path width, number of registers and instruction and data 
memory size. Furthermore, a C compiler for single precision integer 
arithmetic is supported. 

ASIA is a system that synthesizes instruction set architectures for a 
given application, which is available in form of a micro-operation pro- 
gram and a coarse pipeline stage structure of the target architecture. 
The results of ASIA is the microarchitecture definition of an architec- 
ture, which is able to satisfy a given runtime constraint and which uses 
a data stationary control model. 

Commercial ASIP design environments supporting more flexible target 
architectures are currently being designed by TargetCompiler Technolo- 
gies [240] and by STMicrolectronics (Flexware2 [199]). 

The environment of TargetCompiler Technologies is based on the high 
level modeling language nML [84]. The supported retargetable tools 
include a C compiler, an instruction set simulator, an assembler and 
linker as well as a hardware description generator. The description lan- 
guage nML has been extended to support pipelined architectures. Un- 
fortunately, there is no list of restrictions available, which describe the 
limitations of the supported processor architecture classes. 




10 



Chapter 2. Focus and Related Work 



Flexware2 [199] is the successor of Flexware and is based on the in- 
struction set description language IDL [200] . The Flexware2 environ- 
ment enables the generation of instruction set simulator, assembler, and 
linker. The HLL compiler is based on the COSY framework [3] and 
needs a separate processor description. Hardware generation from IDL 
is currently not supported. 

Scientific ASIP environments targeting flexible processor architectures 
comprise PEAS-III [146] [126] and MetaCore [280]. 

The ASIP design environment PEAS-III uses a textual micro-operation 
description and provides a GUI for ASIP modeling. PEAS-III enables 
the generation of a synthesizable hardware model [126] and develop- 
ment tools like a C Compiler [74]. Unfortunately, there is little infor- 
mation available about the supported processor classes and about the 
quality of results. 

The environment MetaCore uses a predefined parameterizable DSP mi- 
croarchitecture supporting an essential set of basic instructions as well 
as user-selectable predefined instructions. Eurthermore, the designer 
can add application-specific instructions. The specification of the tar- 
get processor is achieved using a structural (MSE) and a behavioral 
(MBE) description language. The generated development tools com- 
prise the entire set of typical software design tools including a GCC- 
based C compiler, instruction set simulator and profiling tools. 



2.2.2 ASIP Case Studies 



Over the past few years many ASIPs have been designed in industry and 
in academia. Table 2. 1 provides an overview of relevant academic and 
industrial ASIP designs and case studies. It has to be emphasized that 
most of the published ASIP case studies focus on performance rather 
than power optimization. An exception is the work of Kuulusa [152], 
which evaluates the effect of instruction set modifications on the power 
consumption, but without using more extensive architectural optimiza- 
tions. 
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Authors, Affiliation 
and Reference 


Design Description/Focus 


Kuulusa, Tampere Univ. [152] 


configurable DSP core for GSM 


B. Kienhuis et al., 

Philips Research [140] [162] 


stream-based data flow arch. 


A. Alomary 
Univ. of Tokyo [8] 


GCC -based [229] primitive 
operations 


J. Van Praet, 

IMEC, Leuven [262] 


autom. analysis of instruction 
bundles, then incremental arch, 
optimization 


A. Fauth, 

Univ. of Berlin [79] [78] 


user specified archicture using 
nML [84] with HW generation 


Ing-Jer Huang, 
use [115] 


focus on instruction definition 
for a given architecture 


F. Onion, 

UC Irvine [195] 


compiler assisted insertion of 
instructions using chained ops. 


Q. Zhao, 

Eindhoven Univ. [282] 


static resource model for high- 
level ISA and compiler design 


J. Gong, 

UC Irvine [96] [95] 


parameterizable VLIW 
architecture 


M. Arnold, 
Delft Univ. [17] 


limited interconnections are 
investigated 


K. Kticilkcakar, 
Escalade Corp. [150] 


only incr. modifications of 
largely fixed arch. 


H. Choi, Korea Inst, 
of Sc. and Tech. [49] 


case study with PARTITA- 
design environment 


J.-H. Yang, Korea Inst, 
of Sc. and Tech. [70] 


case study with MetaCore 
environment 


P. Faraboschi 
M. Gschwind 

IBM Research Center [101] 


multi-cluster VLIW, case 
study for Prolog and vector 
prefetching 


M. Itoh, Y. Takeuchi, 
Osaka Univ. [127] 


HDL generation from a /i-op. 
descr. (PEAS-III) 


R. Camposano, J. Wildberg 
GMD [41] 


case study with CASTLE 
environment 



Table 2.1: ASIP Case Studies 



2.2.3 Basic Low-Power Design Techniques 

In the following, several low-power design approaehes are diseussed. 
These approaehes are applieable to general purpose and applieation- 
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specific processors, but also to dedicated hardware designs^. Due to 
the fact that a detailed discussion of low-power hardware design tech- 
niques is presented in Chapter 3, only the most important approaches 
are summarized at this point. 

There is a variety of publications concerning low-power hardware de- 
sign in general e.g. [177] [44] [45] [66] [104] [221] [251]. Many of 
them cover technological issues and also typical full-custom techniques 
like voltage scaling in combination with parallelization, which are not 
directly applicable to semi-custom chip design. Other publications also 
cover algorithmic optimizations like optimized filter coefficients, arith- 
metic operator minimization, and optimization of number representa- 
tions. High-level optimizations include scheduling techniques in order 
to exploit data correlation. Standard circuit optimizations like guard- 
ing techniques (which includes clock gating) and precomputation are 
also treated. A more detailed review of work on this topic is given in 
Section 3.3. 

Many other publications [273] [25] [83] exploit the statistical properties 
of the encoded values by using redundant additional information or by 
optimized non-redundant code assignments. The goal of these encod- 
ing techniques is to lower the toggle frequency of heavily loaded nodes 
in order to save power. In [264], an approach is presented to reduce 
the power in a cache memory-based on physical modifications of the 
memory architecture, which avoid the decharge activity of the high ca- 
pacitance bit lines. These approaches are related to the idea that has 
been used in Section 6.3.1 to reduce the energy consumption in embed- 
ded instruction memories. 

The following publications focus on architectural and/or instruction set 
modifications to decrease the power consumption. 

In [131] the number of general purpose registers of an ARM7TDMI 
[5] is varied in order to evaluate its effect on power consumption and 
runtime. Unfortunately, the power model of this work neglects the re- 
duced energy consumption of the changed register file and only takes 
into account the energy of the memory accesses. 



^For the greater part, only those approaches have been selected which are applicable to semi-custom 
ASIP design. 
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In [231] an architecture tuning methodology is described that uses a 
fixed instruction set and tunes the implementation e.g. by adding spe- 
cialized registers for frequently addressed memory locations etc. The 
paper provides an evaluation of sample modifications with a limited 
scope resulting in small power savings. 

In [272] an energy-conscious methodology for the exploration of a 
processor-accelerator system using an ARM-compatible core and a cus- 
tom accelerator is described. The processor instruction set in this case, 
however, is not application-specific. Similar work has been performed 
in [100]. 

Several ideas concerning instruction sets with the option to generate 
software for low-power consumption are presented in [18]. One idea 
is the concept of programmable bypass and forwarding registers, where 
the compiler decides whether a bypass or a forwarding register can be 
used instead of a real register as data source with the goal to avoid gen- 
eral purpose register accesses. This concept can be regarded as expos- 
ing the microarchitecture to the SW interface, which is not unusual for 
many VLIW architectures. A similar approach has been adopted for the 
scalable processor architecture in Section 7.2 of this thesis. 

The following publications cover general purpose processor design 
techniques. Many of the presented ideas are also applicable to semi- 
custom ASIP design. 

In [36] a summary of energy and power metrics for general purpose 
processor systems is presented together with basic design optimizations 
in order to increase efficiency. These optimization techniques include 
voltage scaling^ and optimum instruction set design including the num- 
ber of registers and the number and kind of functional units and sup- 
ported addressing modes. Furthermore, energy-efficient cache design 
and energy-aware operating systems are discussed. 

The publications of Tiwari [249] [250] cover a wide range of op- 
timizations for high performance processors including technological 
optimizations (low-power libraries), circuit techniques (transistor siz- 
ing, logic optimization on register transfer level) and operating system 
power management techniques. Tiwari also introduces a power model 

^Voltage scaling is typically not applicable to semi-custom designs due to the lack of characterized 
standard cells. 
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[253] for software optimizations, which results in low-power code gen- 
erations strategies [252] e.g. reduction of memory accesses, energy 
driven instruction selection, instruction reordering, instruction packing, 
operand swapping and SW power management. 

A microcontroller explicitly optimized for high energy-efficiency is the 
M«Core of Motorola [187]. Various publications [223] [158] [224] de- 
scribe the low-power techniques that have been applied for this pro- 
cessor core including selective power-down mechanisms, high code 
density, rich register set, multiple data sizes support, loop cache, and 
branch folding. The publications partially include power evaluations 
of these optimizations. Application-specific adaptations have not been 
performed for this core since it mainly targets general purpose micro- 
controller applications. 



2.2.4 Verification 

The verification task targeted by the tool in Section 6.3.2 of this thesis 
checks the correctness of the ASIP HW description with respect to a 
cycle-true instruction set reference model. Even in the case of a auto- 
matically generated hardware description, this verification is important 
to reduce the design risk due to errors in the hardware generation tool. 

Theoretically, this verification task can be realized with a formal veri- 
fication approach like in [11], [135], [167], and [277], but this requires 
a formal specification of the processor description. In the case of the 
LISA environment an appropriate formal description is not available. 
Therefore, functional simulation has to be used, which is also applied 
by many industrial processor design teams [260] [68] [174]. 

The task of providing suitable stimuli for this functional simulation is 
tedious and time consuming, but can be partially automated [43] [149]. 
Kruger [149] presents a tool which is targeted for self test program gen- 
eration using a structural description of the processor as input to the test 
program generator. Chandra [43] discusses a methodology to generate 
stimuli for the IBM S/390 processor. Chandra’s methodology includes 
techniques like symbolic execution and constraint solving in order to 
cover boundary conditions. The described methodology is targeted and 
optimized for a single general purpose processor. 
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2.3 Differences to Previous Work 

As energy consumption is getting more and more important for digi- 
tal chip designs, low-energy ASIP design methodologies are of special 
scientific interest. All of the previously published ASIP design environ- 
ments primarily focus on performance optimization. Some of them are 
also able to evaluate the area consumption and enable performance-area 
tradeoffs. None of them allows explicit energy optimizations, which can 
be performed with the design methodology as proposed in this thesis. 

A similar statement can be made about the related ASIP case studies: 
none of them systematically evaluates the primary sources of energy 
consumption and none of them proposes or performs explicit energy 
optimizations. 

The differentiation against the basic low-power design techniques 
mentioned above is the fact, that this thesis focuses on the special case 
of ASIP-typical energy optimizations by exploiting the large ASIP de- 
sign space including the user-defined instruction set. Previously pub- 
lished energy optimization techniques are used, but also novel ASIP- 
specific energy optimizations are developed. In contrast to the related 
work, thorough evaluations of these optimizations are performed using 
precise gate-level estimations. These techniques have provided essen- 
tial concepts for the enhancement of the LISA’ processor design tools 
in order to facilitate future low-energy ASIP development. 

Verification is an important subtask of complete ASIP design flows 
in order to guarantee a fully functional implementation. This topic is 
treated in this thesis by explicitly covering a tool that has been devel- 
oped for this purpose. The proposed semi-automatic test case generator, 
which is described in Section 6.3.2, uses a similar approach as Kruger 
[149]. However, instead of a structural processor description, the be- 
havioral LISA description is used. Furthermore, speculative execution 
on an instruction set simulator together with user-defined rules guaran- 
tee that meaningful test scenarios can be generated in a short amount of 
time. 



^Please refer to Chapter 6 for a description of the LISA tools suite. 




Chapter 3 



Efficient Low-Power Hardware 
Design 



From a hardware perspective, an ASIP represents a complex finite state 
machine where the state transitions are triggered by the input data and 
the ASIP software. Consequently, all the techniques for efficient low- 
power or low-energy hardware design are also applicable to ASIPs. This 
fact is emphasized in the case of application-specific hardware acceler- 
ators that are tightly coupled to the ASIP core to increase the energy- 
efficiency of the implementation. This approach implicitly includes the 
ASIP software as being an integral part of the hardware implementation, 
which has to be optimized with equal effort. However, plain software 
optimization techniques are beyond the scope of this chapter. The basics 
of software optimization are briefly treated in in Section 5.3.3. 

This chapter defines the critical issues of hardware design that have a 
major influence on the final result. The basics of low-power CMOS 
hardware design are described including physical effects, metrics to 
evaluate different architectures and power estimation techniques. Fi- 
nally, specific design techniques to increase the energy-efficiency of 
synthesized semi-custom hardware are presented together with refer- 
ences to practical applications. This chapter represents a prerequisite 
for Chapter 5 where the ASIP typical hardware and software design 
issues are treated. 



3.1 Metrics of the Implementation and the Hardware 
Design Methodology 

In the following two subsections the terms architectural efficiency and 
design efficiency are defined similar to [263] as principal metrics for the 
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quantitative evaluation of different design alternatives. This discussion 
refers to hardware design in general as well as ASIP development. 



3.1.1 Characteristics of the Implementation 

Implementation constraints are qualitative or quantitative boundary 
conditions that have to be fulfilled in order to obtain a feasible imple- 
mentation for a given signal processing application. 

Two classes of constraints can be identified: 

• Precise constraints have to be fulfilled accurately, which means 
that even a small deviation between constraint and considered pa- 
rameter leads to device failure. This type of constraint is obviously 
only applicable to qualitative or discrete parameters. 

• Minimum (maximum) constraints are typically met by larger 
(smaller) or equal values of the considered implementation pa- 
rameter. Safety margins of the constraints have to be provided 
as a guard against estimation errors. The quantitative difference 
between constraint and considered parameter is referred to as 
“slack”. Two subtypes of min./max. constraints can be identified 
in the case of a feasible implementation with slack> 0: 

- magnitude of slack is unimportant 

- magnitude of slack enhances the implementation and has to 
be optimized 

In the following, a list of requirements for the ASIP hard- and software 
implementation is given. This discussion uses a black box abstraction 
that conceptually contains the complete ASIP hard- and software imple- 
mentation. The following qualitative or quantitative parameters apply 
to this model: 

• correct functionality of the implementation with respect to the 
specified bit-true behavior (precisely constrained) 



correct timing of interfaces (typ. with min./max. constraints) 
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• performance constraints like computational throughput, bit er- 
ror rate, acquisition probability, number of processed data packets 
per time unit (with min. /max. constraints and a slack that need to 
be optimized in order to enhance the performance of the digital 
system e.g. to obtain a competitive advantage) 

• average and peak power consumption of the module (both with 
max. constraints, the average power consumption needs to be min- 
imized e.g. for mobile appliances or to reduce the costs of pack- 
ages and (active) heatsinks) 

• silicon area of the module (with max. constraint and a slack that 
needs to be optimized to reduce fabrication costs) 

• observability and controllability during operation might be 
needed to discover functional errors in the implementation or the 
specification (sometimes with a min. constraint) 

• the routability of the physical design is determined by the inter- 
connection structures of the design and affects the area utilization 
(silicon area overhead) as well as the timing closure (not explicitly 
constrained by most synthesis tools [194] [215] [60]) 

• testability and self-testability is an important feature for con- 
sumer products to identify chips with fabrication faults at a very 
early stage (before bonding) to reduce overall costs (typical with 
a min. constraint and a slack that has to be optimized for higher 
fault coverage) 

• flexibility of an implementation is needed to adapt an implementa- 
tion to different applications or evolving standards or to fix design 
errors of the implementation (sometimes with a min. constraint 
and a slack that needs optimization) 

• reusability or IP-reus ability refers to the degree of genericity and 
flexibility of a design, which enables the reuse of the same design 
with minor modifications for similar applications (typ. with an 
implicit min. constraint, the reusability needs to be increased in 
order to decrease the design effort for similar applications) 

The following characteristics apply to the internal ASIP structure and 
complement this list of characteristics: 
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• the application class (e.g. digital filtering, speech coding, image 
transformations etc.) which can be efficiently mapped to the pro- 
cessor data-path is strongly correlated with the above-mentioned 
flexibility of the implementation (extensions of this application 
class increase the flexibility of the ASIP) 

• either simplicity of the instruction set architecture is required to 
enable hand programmability, if no compiler for the architecture is 
available (this parameter strongly affects the software design time) 

• or the instruction set class (ref. to Chapter 4 for a classification) 
should be selected in order to use an available compiler design 
environment e.g. COSY [3], if the ASIP is intended for high level 
language programming support (a good fit of the instruction set 
class results in a lower effort for compiler retargeting) 

For many of the above-mentioned parameters of the implementation the 
associated slack between constraint and parameter can be quantitatively 
evaluated. For a selected set of N important slack values that are subject 
to explicit optimization, it is useful to define a quantitative efficiency 
in order to compare architectural alternatives. These slack values Sn 
can be associated with application-specific weights Wn such that the 
application- specific architectural efficiency rjarch for these considered 
slack values can be defined as follows: 



harch 



n=N 



n 

n=l 



1 






(3.1) 



The well-known classical efficiency for VLSI circuits ?7 = ^ is a spe- 
cial case of the above mentioned architectural efficiency, which con- 
siders the equally weighted parameters silicon area and computational 
performance (critical path) of an implementation. 



3.1.2 Characteristics of the Design Methodology 

An ideal design methodology achieves the highest possible architec- 
tural efficiency (ref. to Subsection 3.1.1) in zero design time. For prac- 




3.1. Metrics of the Implementation and the Hardware Design Methodology 



21 



tical reasons, a feasible trade-off between these parameters has to be 
selected. The following characteristics of a design methodology are the 
degrees of freedom to control this trade-off during the design phase: 

• The modeling style for a given design task has important effects 
on the design time and the ability to reuse and verify a design. 

- The level of abstraction has to be reasonably selected in a 
hierarchically organized design to reduce the amount of ir- 
relevant details for the current design task - this hierarchical 
organization can be viewed as a vertical partitioning of design 
tasks. 

- Modularization or horizontal partitioning of design tasks, on 
the other hand, reduces the design time in combination with 
concurrent engineering (see below). 

• Design automation in combination with abstraction and appropri- 
ate tool support both for design and verification enables to reduce 
the risk of design errors and to speed up the design process. 

• Debugging on all levels of abstraction should be facilitated by a 
transparent design methodology and appropriate modeling styles. 

• Design reuse is also a means to reduce design time. Design reuse 
has to be performed wherever it is possible to take advantage of 
encapsulated, verified modules enabling higher abstraction levels 
[ 216 ]. 

• Process organization is the mapping of required design tasks to 
the available human resources. A vertical specialization of work 
according to the level of abstraction in the design flow can be used 
together with overlapping execution to reduce the risk of design 
flaws and to parallelize and speed up the design flow. On the other 
hand, horizontal partitioning of the work results in reduced design 
time due to concurrent engineering as well. Typically, a combina- 
tion of these two approaches are used depending on the complex- 
ity of each design task. However, too fine granular partitioning of 
design tasks can lead to inefficiencies due to an overhead in com- 
munication. 
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• Monitoring of the design process and project management is 
mandatory to adaptively control the process organization in order 
to meet the deadlines and to identify problems at an early stage. 

The design efficiency rjdesign can be defined using the architectural effi- 
ciency 7] arch of Subscction 3.1.1 together with the overall design effort 
Tdesign (in man months) and the weight factor w design as follows: 



ddesign 



T, 



design 



'^des 



^arch 



(3.2) 



The design efficiency rjdesign can be used to compare different design 
approaches and methodologies and detect problems a-posteriori in the 
design flow. However, as mentioned before, the number of different 
parameters that affect the design efficiency is large and for practical 
analysis of design issues, a more thorough investigation of the design 
flow and the application is needed. 

A different approach to evaluate different design methodologies and im- 
plementations could focus on monetary costs of chip design and chip 
production. This evaluation model could be easily set up using the costs 
for designing, prototyping as well as the production costs per chip and 
the market volume. A more thorough investigation of design costs is 
given in [121], which focuses on the design of dedicated hardware. 



3.2 Basics of Low-Energy Hardware Design 

VLSI design for reducing the energy consumption of a device basically 
involves two different issues: estimation of power consumption and 
techniques to reduce the power or energy consumption. For two differ- 
ent reasons, a reduction of power or energy consumption is beneficial: 
reduction of the peak power is needed to avoid problems with voltage 
drops and ground bouncing within a chip. On the other hand, reduction 
of the average power is mostly driven by mobile systems to increase the 
battery lifetime, but also for consumer applications, where device costs 
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due to expensive packages, heat sinks and power supplies are of signifi- 
cance. Furthermore, environmental concerns have triggered low-power 
initiatives like the Energy Star program [259] . Finally, the reliability 
of a system is increased by lowering the average power due to reduced 
thermal stress and reduced electromigration [191]. 

In the following subsections the sources for digital CMOS energy con- 
sumption, the energy estimation approaches and techniques to reduce 
the energy and power consumption are discussed. 



3.2.1 Sources of CMOS Energy Consumption 

The total power consumption Ptotai of a CMOS circuit with the supply 
voltage Vdd can be described by the following equation: 



Ptotai ^standhy^dd P ^leakage^dd T Isc^dd + DavgCiV^^fclkO.3) 



= P. 



standby 



+ Pi 



leakage 



+ P. 



short-circuit 



capacitive 



The standby current htandby is typically completely avoided by a 
proper CMOS circuit style and can usually be neglected. However for 
certain circuit styles (pseudo NMOS, NMOS pass transistor logic, and 
memory cores) I standby can be an issue [197]. 

The leakage current Iieakage is due to the reverse bias current in the 
parasitic diodes of the diffusion zones and the bulk region of the MOS 
transistors and also due to the subthreshold current in the case of gate 
voltages below the threshold voltage. This effect is an issue in the case 
of current and future technologies with significantly reduced power sup- 
ply voltages. 

The short-circuit power Pshort_drcuit is due to the hot path in typical 
CMOS circuits (like the inverter in Figure 3.1) when both transistors are 
on for a short period of time during transitions. This term depends on 
the input rise (fall) time (slew rate), the output load and the transistor 
sizes (internal capacitances and gain factors). 

The capacitive power Pcapaduve depends on the average switching 
probability aavg, the clock frequency fdk and the switched capacitance 
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Figure 3.1: Short-Circuit Current in a CMOS Inverter 



Cl. It is caused by the power needed to eharge and diseharge (in most 
eases parasitie) eapaeitanees on the ehip. The switehing probability 
Dnode (also referred to as toggle aetivity or as toggle probability) of a 
single node is defined as the ratio of the number of transitions of the 
eonsidered logie node to the number of eloek transitions within the 
simulation interval. For strietly synehronous design style using posi- 
tive (negative) eloek edge triggered flip-flops, a logie node transition 
ean only oeeur after the rising (falling) edge of the eloek, thus, the max- 
imum transition probability for a logie node without glitches is a = 
The transition probability of the eloek itself is a = 1. The eapaeitanee 
Cl in Equation 3.4 is the sum of all node eapaeitanees in the eonsidered 
eireuit whereas the aavg is the equivalent average transition probability 
of Cl aeeording to the following equation: 



i=N i=N 

Dnode,iCnode,i D node, iC node, i 

2 = 1 2=1 



^avg 



Cnode.,i 

2 = 1 



c 



(3.4) 



where anode,i and Cnode,i are the transition probabilities and the eapaei- 
tanees respeetively of node i. 
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The term static power consumption refers to the sum of P standby and 
Pieakage which typically represent negligible constants for the majority 
of current CMOS designs. 

The sum of Pshort_circuit and PcapaciUve is referred to as dynamic power 
which represents the most significant part of the overall power bud- 
get for state-of-the-art CMOS designs. For practical purposes it is 
useful to replace the short-circuit current I sc with an equivalent short- 
circuit capacitance Cnode,sc, because the amount of short-circuit power 
Pshort_circuit is also proportioual to the switching activity of the circuit 
nodes. If C node, sc is chosen according to the following equation 



a 



node, sc 



^ node, sc 



^node^ddfcLK 



(3.5) 



then the effective dynamic capacitance Cdyn_node,i of a node i is can be 
expressed as follows: 



a 



dyn^node,i 



= a 



node,i 



+ a 



node, sc, i 



(3.6) 



Together with 



Cdyn 



i=N 

^ ^ C dyn_node,i 
i=l 



(3.7) 



and 



a 



avg^dyn 



i=N 

^node,iC dyn_node,i 

2=1 



c, 



dyn 



(3.8) 



Equation 3.4 can be simplified to 
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Ptotal ^standhy^dd ^leakage^dd Dayg_dynCdynVciy^f elk (3-9) 



= P. 



standby 



Pi 



leakage 



P 



dynamic 



which states more clearly the significant physical effect of the average 
dynamic toggle probability aavg_dyn on power consumption. 



3.2.2 Basic Principles of Lowering the Power Consumption 

The following considerations directly refer to the total CMOS power 
consumption as a goal for minimization. As mentioned before, for 
current CMOS technologies the major part of CMOS power in Equa- 
tion 3.10 is the dynamic power consumption: 



Pdynamic Davg_dynCdyn^ddf elk 



(3.10) 



The dynamic power consumption can obviously be decreased by 

• reducing the supply voltage Vdd which results in a quadratic de- 
crease of Pdynamic 

• reducing the clock frequency fdk 

• reducing the effective switched capacitance Cdyn (which includes 
the physical node capacitance and the equivalent short-circuit ca- 
pacitance) 

• reducing the switching probability aavg_dyn- 

A reduction of the supply voltage Vdd increases the combinational cir- 
cuit delay Tprop (not the interconnection delay) according to 



Pprop ^ 



Vdd 



(Vdd - vy 



with a G [1.0, 2.0] (3.11) 
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which has to be compensated for systems with high throughput e.g. by 
using parallelized or pipelined processing units. Unfortunately, repli- 
cation results in a higher power consumption, too, but the quadratic 
decrease of power consumption due to voltage scaling outweighs this 
increase in many practical cases like in [45]. A different approach is 
presented in [71], where the design is partitioned into critical regions 
with a small timing slack and uncritical regions with a high timing slack 
respectively using a dual supply voltage without affecting the critical 
path. For very short feature sizes the exponent a in Equation 3.11 ap- 
proaches 1.0 and for Vdd » Vt the delay is nearly a constant, which is 
very favorable for voltage scaling. 

However, voltage scaling is limited by technological parameters such as 
subthreshold leakage current, leakage power and reliability issues due 
to signal integrity [204] [72]. 

A reduction of the clock frequency fdk without further changes obvi- 
ously results in a proportional reduction of the computational perfor- 
mance for synchronous circuits as well as in a proportional reduction of 
power consumption. This reduction is limited by the minimum compu- 
tational performance required by an application. The total energy to 
perform a given computational task is unaffected by a reduction of fdk, 
if no additional changes are applied. 

A reduction of the effective switched capacitance Cdyn potentially re- 
sults in a higher computational performance because of reduced inter- 
connection and transition delays. Typically, this reduction has to be 
achieved using different hardware architectures or technologies. How- 
ever, the minimization of Cdyn is limited by architectural bounds (due 
to the minimum required interconnection structure) and technologi- 
cal bounds (due to high interconnection capacitances and unavoidable 
short-circuit currents). Nevertheless, the logic designer can reduce Cdyn 
on the architectural level by using local instead of global interconnec- 
tions e.g. with systolic arrays or clustered arithmetic using segmented 
communication buses. 

A reduction of the toggle activity aavg_dyn is typically also achieved 
with optimized hardware architectures. This reduction is especially 
effective if applied to logic nodes with a high node capacitance 
Cdyn_node,i, C-g- highly loaded chip pads, interconnection buses etc. 
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However, this reduction is limited by information theoretical bounds 
(due to the minimum needed communication resulting in toggle activity 
as a function of the signal entropies [258] [226]). It is actually possi- 
ble to exploit data redundancy (e.g. data correlation, non-uniform data 
distribution etc.) to reduce the toggle activity on certain nodes. 



For power and energy critical applications like embedded /i-processors 
all of the above mentioned parameters are optimized for state of the 
art devices as described in [14]. However, for semi-custom devices 
the pressure for low-power design techniques is obviously not yet as 
critical. Typically, the supply voltage for semi- custom technologies is 
restricted to a certain range [Knin, Vniax]- Only for predefined working 
conditions, which include defined sets of values for the supply voltage, 
the operating temperature and the quality of the fabrication process, the 
relevant electrical parameters, the power consumption and the delay for 
cells and interconnection of the target technology are available. Be- 
yond these working conditions, correct functionality of the chip is not 
guaranteed by the foundry. This makes it impossible for a conservative 
designer to use voltage scaling beyond Vmin- Nevertheless, for aggres- 
sive low-power applications, the usable voltage range might be extended 
below Vmin- 



For semi-custom design, the remaining degrees of freedom, namely 
clock frequency, switching activity and capacitance reduction have to 
be simultaneously optimized to maximize the power savings. 



3.2.3 Measuring and Quantifying Energy-Efficiency 

In order to obtain precise values of the power or energy consumption, 
appropriate analysis techniques are necessary. Power analysis tech- 
niques for semi-custom chips can take advantage of the different lev- 
els of abstractions, namely, the circuit level, the cell and the RTL level. 
The layout extraction of each standard cell performed by the technology 
vendor results in an equivalent schematic using resistors, capacitors, in- 
ductors and current/voltage sources. This schematic or parameterized 
model can then be simulated with SPICE [182] or similar analysis tools 
using transient analysis to obtain precise estimates for the switching 
power of the cell. These simulations have to be performed for all the 
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defined working conditions and are often calibrated with measurement 
data from actual test chips. After this process, the simulated and mea- 
sured values can be used as library data for power analysis steps at the 
cell level. 

At the cell level, so-called gate-level simulations of the synthesized 
netlist of standard cells can be performed. In order to get more pre- 
cise estimations for the interconnection capacitances, extracted values 
of the design layout can be used. If these are unavailable, wire-load 
models that represent worst case scenarios for synthesis have to replace 
extraced capacitance values [93]. 

The gate-level simulations have to use a sufficient number of input stim- 
uli in order to get meaningful estimations. This leads to a considerable 
simulation effort for larger designs. 

A complementary approach to cell level analysis are probabilistic power 
estimation techniques, which use statistical properties to describe the 
behavior of signals. There are several approaches for a statistical de- 
scription of logic signals: 

• using static probabilities for the state logic zero (one) 

• using the transition probability under the assumption of a memo- 
ryless logic signal 

• using two different transition probabilities as a function of the cur- 
rent state of the signal (which can be associated to a memory of 
length one) which is referred to as lag-one signal model [281] 

• using a lag-N signal model 

• using the static probability together with a lag-zero, lag-one or a 
lag-N signal model (with increasing computational complexity) 

A signal that is described by one of the statistical properties mentioned 
above can be used to propagate the statistical properties of the inputs 
through combinational logic yielding the statistical properties of the 
output(s). 

Given a Boolean function y = f{xi, X 2 , Xn) and the static probability 
(for logic 1 without loss of generality) P{Xi) as well as the (lag-zero) 
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transition density D{xi) the statistical properties of y can be calculated 
using Shannon’s decomposition of the function / which is 



y = Xif{xi = 1) + Xif{xi = 0) (3.12) 

The static probability of this decomposed representation yields 

P{y) = P{x,)P{f{x, = 1)) + (1 - P{x,))P{f{x, = 0)) (3.13) 

This decomposition can be recursively evaluated until the function / is 
completely decomposed. 

A similar approach can be made for the transition density D{y): a tran- 
sition of y as a response to a change of Xi occurs, if f{xi = 0) f{xi = 
1). This condition which is also called the Boolean Difference of y w. r. 
t. Xi can be expressed as an exclusive-OR of the two functions: 



dy 

dxi 



f{xi = 1) © f{xi = 0) 



1 



(3.14) 



The probability of a transition of y due to a transition of Xj is given by 
the product of the static probability for which 3.14 is valid and the tran- 
sition probability of D{xi). Iterative application of this formula yields 




(3.15) 



This propagation of statistical signal properties is used e.g. by commer- 
cial tools like Synopsys’ DesignCompiler [235] for internal nodes that 
have not been annotated with static probability and switching activity. 
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The purpose of low-power or low-energy design techniques is to find an 
architecture for a given application or a set of applications that repre- 
sents the optimum concerning the “power or energy consumption”. For 
a reasonable comparison of different architectures concerning power or 
energy consumption, several metrics can be used: 

• plain power in mW to describe e.g. the average or peak power of 
an architecture 

• plain energy in mWs or niJ which describes the (average) energy 
consumption of a given architecture to perform a given application 

An additional metric which is often used for (circuit-level) VLSI de- 
sign is the power delay product. This metric can be viewed as energy 
per computation, which expresses the energy-efficiency of an imple- 
mentation for a given task. The result of the considered computation is 
available after the delay, which is part of this metric, and the total en- 
ergy of this computation is the average power times the delay. Another 
interpretation of this metric makes sense, if voltage scaling or other 
techniques are used that have an impact on power and delay: reduced 
voltage reduces the power quadratically whereas the delay is typically 
(nonlinearly, refer to Equation 3.11) increased. This metric compen- 
sates the effect of the power decrease with the effect of the delay in- 
crease to get an equal weight both for power and delay. A non-equal 
weight is included in the energy delay product [114], which is equal to 
a ’power-delay-delay’ product; this metric is useful for applications that 
favor processing speed over energy consumption. 

Various other metrics have been proposed and are used for different pur- 
poses, specifically for /r -processors, where the application is not fixed: 

• Mega Instructions per mW (MIPS/mW) can be used to compare 
different /i-processor implementations that have the same or a very 
similar instruction set with a typical benchmark application 

• other metrics use operations instead of instructions e.g. [37] uses 
power per throughput (in mW per operations per second) and en- 
ergy per throughput for fixed throughput and maximum through- 
put operation as well as a metric, which normalizes total energy 
consumption to the maximum throughput scenario 
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The choice of the power metric strongly depends on the optimization 
goal. For many portable applications with fixed processing rates (which 
corresponds to fixed throughput) constrained by the application (e.g. 
speech, mobile reception/transmission etc.) the metric simply has to 
maximize the battery lifetime. In such a case a metric energy per com- 
putational task or energy per typical set of operations is suitable, which 
has to be interpreted as the above-mentioned energy per operation for a 
given benchmark. For many non-battery operated appliances, the metric 
average power is often sufficient in order to keep package and cooling 
costs under control. However, a reduction of the average consumed 
energy leads to a reduction of the average power and vice versa. There- 
fore, these distinct metrics can be indirectly optimized simultaneously 
by just considering the average energy per computational task. In the 
following discussions and also in Chapter 7 the metric average energy 
per computational task is used as optimization goal and the terms power 
and energy optimization are used as synonyms. 



3.3 Techniques to Reduce the Energy Consumption 

The focus of this section is to characterize techniques to optimize the 
energy per computational task starting with a high level behavioral im- 
plementation of the algorithm and ending with a synthesized netlist of 
standard cells. Figure 3.2 depicts the different levels of abstraction: the 
impact of power saving techniques decreases with increasing level of 
implementation detail, whereas the accuracy of power estimation in- 
creases. This is an issue that makes it difficult to predict the effect of 
e.g. algorithmic changes on the power consumption. 

It has to be mentioned, that the following classification into sys- 
tem/architecure level, logic and physical level is not orthogonal for all 
low -power techniques: there are techniques which affect more than one 
level in this hierarchy. This fact emphasizes the importance of joint 
power optimization on all levels of abstraction. 

Furthermore, the techniques described in this section are typically re- 
stricted to semi-custom design flows - special circuit techniques en- 
abled by full-custom design are not covered. An exception is the im- 
portant technique of voltage scaling, which is not commonly used for 




3.3. Techniques to Reduce the Energy Consumption 



33 



Abstraction 

Level 



Achievable 

Power 

Savings 



Analysis 

Resources 



Analysis 

Accuracy 



Algorithm 

System 

Architecture 

Gate 

Circuit 

Physical Design 



Most 




Least 



Least 




Worst 




Most Best 



Figure 3.2: Level of Abstraction vs. Possible Savings (Irwin [124]) 



current semi-custom design flows. However, this technique might be- 
come important in the future and is therefore included in the following 
discussion. 



3.3.1 System and Architecture Level 

Given a certain application, the first choice in the design flow is the se- 
lection of an optimum algorithm with respect to the cost function of the 
design. The term cost depends on the application and typically includes 
the number of operations (additions, multiplications, logic operations 
etc.), the number of memory accesses as well as the memory size. The 
decisions on this level of abstraction typically have a large impact on the 
design efficiency. Unfortunately, the estimations on this level of abstrac- 
tions tend to be significantly inaccurate unless a complete implementa- 
tion in a high level language is available and a complete floating point to 
fixed point conversion has been done. After this design step (which has 
to be performed for all considered algorithms) more precise values for 
the complexity of memories, arithmetic, and logical operators can be 
estimated. The traditional purpose of this algorithm optimization is the 
reduction of operations, memory accesses and memory size in order 
to reduce the area and to increase the computational throughput of an 
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implementation. Obviously, this optimization also significantly reduces 
the energy consumption of the final implementation. The scheduling 
of operations also has an impact on the power consumption. In [44] 
an example is presented, which exploits the associativity and commu- 
tativity of addition by reordering the data flow graph and adding the 
smaller operands first. For this example a saving of about 30% in power 
consumption is reported. 

After the algorithm optimization and selection is finished, the partition- 
ing into building blocks - dedicated hardware, configurable HW blocks, 
or programmable devices - has to take place. This partitioning must find 
a feasible solution with respect to processing power and data rates (for 
case studies refer to e.g. [29] or [148]). Excessive flexibility has to be 
restricted to the required minimum in order to avoid an unnecessary in- 
crease in power consumption [1] [92]. Thus, it is important to identify, 
whether the amount of required flexibility of a building block can be 
satisfied with (coarse-grain configurable) dedicated hardware. An im- 
portant parameter for this partitioning is obviously the computational 
performance of the considered task. Moreover, parameters like area 
efficiency for low data-rate tasks and flexibility requirements for error- 
prone and quickly changing control tasks have to be taken into account 

[91]. 

It is possible for some algorithms to use adaptive implementations, 
where the number of operations that are needed for this task can be 
scaled to reduce energy. This typically also affects the algorithmic per- 
formance (e.g. bit error rate, mean square error etc.). However, if the 
application permits a certain algorithmic degradation under some cir- 
cumstances, it might be advantageous to detect this condition and scale 
the algorithm accordingly. Theoretically, any iterative algorithm is a 
candidate for this saving technique provided that the overhead of esti- 
mating the scaling criterion is small compared to the expected savings. 
One application for such a technique are the adaptive filters in [172] 
[144], where the signal-to-noise ratio is estimated and used to adapt the 
filter length of the FIR filter. Another example monitors and controls the 
progress of iterative matrix diagonalization by low overhead techniques 
[147]. 

An often-used technique for lowering the power consumption is volt- 
age scaling typically in combination with parallelization of hardware 
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units. This techniques has already been raised in Section 3.2.2. If the 
initial algorithm is easily parallelizable or pipelinable, this technique is 
straightforward. However, many algorithms are inherently sequential 
due to data dependencies, which makes parallelization more difficult if 
not impossible. In such a case it might be worth changing the algo- 
rithm to an approximation that exhibits higher parallelism like in the 
well-known case of Turbo decoders [173]. On the other hand, the se- 
quential description of an algorithm can be modified without changing 
the output behavior of the algorithm in order to exploit more parallel 
operations e.g. by loop unrolling [45]. 

Another approach for tasks with low or non-existent throughput con- 
straints is the reduction of the supply voltage without a change in im- 
plementation tolerating a certain degradation of computational perfor- 
mance. This technique has been used in [206] to scale the voltage 
dynamically for a microprocessor system by using a power-conscious 
operating system. However, for many DSP applications with fixed 
throughput requirements this approach is infeasible. 

Memory accesses are expensive in terms of energy consumption, be- 
cause heavily loaded internal bit- and word-lines have to be switched. 
The average energy consumption of a memory (read or write) access in- 
creases with the memory size. To make matters worse, external memory 
accesses require switching even higher pad and external capacitances. 
Therefore, algorithmic transformations in order to reduce the number 
of memory accesses and/or to reduce the memory size are also effective 
power saving techniques on the system level (cf. [85], [181] and [42] for 
examples). Accesses to large memories should be reduced by using an 
appropriate memory hierarchy: starting with registers as the lowest level 
of hierarchy, this hierarchy ends with large on-chip or external memory 
banks. Accesses to registers are obviously much less power consuming 
(and typically also much faster) than accesses to larger memory blocks. 
Favoring local over global communication in this example, enables to 
decrease the power consumption. 

A well-known technique for many /i-processors is applicable for any 
kind of low-power hardware: power management, which shuts down 
inactive parts of the chip. This can be done on different levels e.g. 
by gating the clock for unused parts of the circuit (which is automated 
by Logic Synthesis Tools l ik e the DesignCompiler [235]) or even by 




36 



Chapter 3. Efficient Low-Power Hardware Design 



entirely shutting down the clock generation unit in the phase-locked 
loop like in [73]. Power management for complete modules on the chip 
requires either software support e.g. provided by a power-conscious 
operating system or a dedicated hardware controller. 

For programmable architectures, the overhead in terms of instruction 
fetching, instruction decoding, data routing etc. can be reduced by in- 
creasing the number of useful operations per time unit without increas- 
ing the overhead energy. In [37] it is stated that VLIW architectures 
are the best candidate for this optimization because they exploit instruc- 
tion level parallelism (ILP). However, real VLIW implementations tend 
to increase the overhead energy significantly due to larger instruction 
memories and decoders. This disadvantage can be partially reduced 
by instruction compression techniques to reduce the instruction mem- 
ory width by avoiding the explicit coding op no-operations opcodes in 
the instruction. Furthermore, more elaborate compression schemes re- 
duce the redundancy of programs by exploiting statistical properties. 
Examples for instruction compression techniques are the simple fetch 
scheme of the commercially available TMS320C62xx [246] and also 
more elaborate techniques used in academia like in [163]. The simple 
scheme of the TMS320C62xx, however, results in several power con- 
suming decoding stages, which are needed to decode and route instruc- 
tions to functional units. On the other hand, more elaborate compres- 
sion schemes result in typically significant hardware effort and energy to 
decompress the code due to large look-up tables. A completely orthog- 
onal technique has been used in the case study of Chapter 7.1, where 
application- specific instructions have been implemented to increase the 
number of parallel operations per instruction without a significant im- 
pact on the overhead energy. 



3.3.2 Register Transfer and Logic Level 

Low-power techniques on the register transfer (RTL) and on the logic 
level can be subdivided into techniques for lowering the capacitance and 
the switched voltage as well as into techniques to reduce the toggle rate 
of nodes with a high relative capacitance. Furthermore, toggle activity 
for un-useful calculations should be reduced to a minimum. Reduc- 
tion of the switched voltage is beyond the scope of this thesis, because 
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it requires special circuit techniques that are (so far) not applicable to 
semi-custom design flows. 

Lowering the capacitance can be achieved by reducing or avoiding 
global communication as far as possible because global communica- 
tion implicitly requires switching long interconnections with high ca- 
pacitance. However, for heterogeneous systems using different layout 
blocks on a single chip it is often unavoidable to use long interconnec- 
tions. In such a case the interconnection network has to be reduced 
to a minimum and the topology should favor point-to-point or nearest- 
neighbor connections [1]. For the same reason, external communication 
should be reduced to a minimum e.g. by using internal cache mem- 
ories. Much effort of lowering capacitances is used by the synthesis 
tool, which implicitly reduces the switched capacitance by logic opti- 
mization targeting minimum area and in many cases also if targeting 
maximum speed. Advanced techniques to explicitly reduce power by 
optimum technology mapping are reported in [248]. 

If the capacitances can not be further reduced, the orthogonal approach 
is to reduce the switching activity of interconnections with high ca- 
pacitances. Various approaches have been described in the technical 
literature. The most popular technique - clock gating - can be classi- 
fied as a so-called guarding technique. Clock gating means to shut 
down the clocking for a certain group of registers under a certain guard 
condition. An obvious example for this technique is to shut down the 
clocking of pipelined functional units in a microprocessor e.g. in [223]. 
Clock gating techniques on a more fine-granular level are possible like 
in Figure 3.3 [45], where the input of a comparator is guarded against 
the trivial condition that the MSBs are different to avoid the evaluation 
of the full input word lengths in this case. This special guarding logic 
together with the MSB comparator is also called precomputation logic, 
because the result can be quickly precomputed using a subset of the cir- 
cuit inputs. Furthermore, guarding techniques to avoid propagation of 
data values into functional units that are connected to a common bus 
[91] [251] are extremely efficient, because they can typically be im- 
plemented with minor overhead in area and design effort. So-called ex- 
tended guarding techniques [251] comprise conventional guarding logic 
as well as additional logic that can be viewed as precomputation logic. 
Guarding techniques involve several issues with common standard cell 
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Figure 3.3: Gated Comparator 



design flows: Firstly, the use of latehes is typieally prohibited due to 
testability issues. Seeondly, the effleieney of guarding teehniques typ- 
ieally depends on the timing eonstraint that the guarding eondition is 
stable before the signals that have to be guarded are stable. The latter 
relative timing eonstraint makes timing verifieation and physieal design 
signifieantly more eomplieated. 

Pipelining of eombinational logie has several effeets: firstly, the eritieal 
path of the (synehronous) implementation is shortened, whieh enables 
savings due to voltage sealing or due to slower implementations of arith- 
metie operations with a higher energy-effieieney (a eomparison of arith- 
metie implementations is given in [40]). Seeondly, glitehes (also ealled 
spurious transitions) within the eombinational logie due to unbalaneed 
signal propagation are redueed, whieh also results in lower energy. Un- 
fortunately, pipeline registers in semi-eustom teehnology are typieally 
extremely eostly in terms of area (with an implieit inerease of eapaei- 
tanees due to higher distanees on the ehip) and, to make matters worse, 
inerease the eloek power (in the eloek tree as well as in the register 
eireuits). This negative effeet on power eonsumption has to be eompen- 
sated with eloek gating, wherever this is possible. Retiming ean also be 
used to reduee the area penalty of pipelining to a eertain extent as well 
as to reduee the switehing aetivity of logie nodes [197]. 
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Figure 3.4: Flattening of Operators and Logic 



Further reduction of switching activity on highly capacitive nodes due 
to glitching can be achieved by reorganization of logic gates and op- 
erators [45] [142] like in the examples of Figure 3.4. Reorganization 
of operators has to be typically performed manually but reorganization 
of logic cells and also reordering of equivalent inputs [281] can be auto- 
matically performed by commercial synthesis tools [64]. The optimiza- 
tion tasks of the logic synthesis tool can be subdivided into combina- 
tional optimizations like 

• don’t care optimization [214] 

• path balancing 

• factorization 

as well as sequential optimization like 

• state encoding 

• retiming 

The data representation itself also has an impact on the switching ac- 
tivity: in [45] the transition probability for each bit of 16 bit audio data 
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represented by 2’s complement and by sign-magnitude numbers is com- 
pared. The results (which are obviously data dependent) indicate, that 
due to the signal correlation of audio signals the switching probabil- 
ity of the higher weighted bits can be significantly reduced by a sign- 
magnitude number representation. This is interesting for signals which 
have to be transferred over high capacitive system buses. In general, 
multiplexing of uncorrelated data over high capacitive buses tends to 
consume more power than using parallel buses with correlated signals 
[45]. This obviously represents an area-energy tradeoff. Similar state- 
ments have been made about using resource sharing with uncorrelated 
data streams. In [83] different encoding techniques for address and data 
buses have been evaluated. It has been shown that these techniques 
heavily depend on the statistical properties of the transmitted data. 

Other approaches try to minimize the memory power consumption by 
using runtime compression techniques in combination with intelligent 
memories [185]. 

In Chapter 7.1.3 the effect of minimizing the internal power of an in- 
struction memory by reducing the number of discharging events in the 
instruction ROM is described. This minimization has been performed 
automatically using instruction-frequency-driven maximum weight en- 
coding ^ The tools that have been developed for this optimization (refer 
to Subsection 6.3.1 for details) can also reduce the switching activity of 
the (external) instruction bus, if this is desired. A more limited approach 
is described in [273] where the don’t care bits in a microprogrammed 
control unit are optimally assigned using trace driven activity evalua- 
tions. 



3.3.3 Physical Level 

On this level of abstraction the number of manually guided optimiza- 
tions is quite limited because the semi-custom design flow uses auto- 
matic place and route tools in order to transform the netlist of standard 
cells into a chip layout. The place and route tools automatically mini- 
mize the wire length (and wire capacitances) according to the time con- 
straints. However, this does not necessarily represent the optimum con- 

*This technique has been developed and published in [129] [90], 
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cerning power consumption, because the switching activity is typically 
not taken into account. Automatic gate sizing using in-place optimiza- 
tion (IPO) with area and I/O standardized buffer cells can be used to 
optimize the transition times of logic as well as clock nodes after an 
initial place and route pass has been performed. 

There are some design tasks which can nevertheless be exploited to save 
power on this level of abstraction: partitioning and back-annotating of 
layout information to the synthesis tool. 

Partitioning and floorplanning for low-power can be done taking into 
account the interconnections between the layout blocks, which are typi- 
cally defined earlier in the design process (normally during architecture 
design). The length of interconnections with high switching frequency 
should obviously be minimized which corresponds to a minimization 
of the distance between the associated layout blocks. The I/O ports of 
physical blocks may have to be manually defined in order to achieve 
optimum interconnections. 

Back- annotating of layout capacitances together with the switching ac- 
tivity information from gate level simulation to the synthesis tool can 
enable efficient reoptimization of logic for low-power. This technique 
has already been described in the previous subsection. 



3.4 Concluding Remarks 

This chapter summarizes the metrics for efficient hardware implemen- 
tation and efficient hardware design. Furthermore, the sources of en- 
ergy consumption of state-of-the-art CMOS technology are described. 
Moreover, a concise summary of low-power design principles as well 
as specific design techniques in order to lower the power or energy con- 
sumption is given. Many of these techniques are used in the ASIP de- 
sign flow described in Chapter 5. 




Chapter 4 



Application-Specific Processor 
Architectures 



The ASIP design spaee elassifieation presented in this ehapter identi- 
fies the degrees of freedom in the ASIP design proeess. This diseussion 
negleets low-level hardware implementation details, which have been 
treated in the previous chapter; it rather uses the abstraction of word 
level hardware operators like addition, muliplication, etc. The design 
decisions on this level of abstraction significantly affect the resulting 
architectural efficiency as well as the overall design efficiency. As a 
consequence, this chapter is a prerequisite for the ASIP design flow 
presented in Chapter 5. Furthermore, this classification enables to de- 
cide, if a certain architecture can be supported by an available high level 
language compiler design environment or a retargetable HLL compiler 
to enable high level language support. Finally, this chapter treats the 
important relation between high level design decisions and critical low 
level implementation characteristics. This relation has to be well under- 
stood in order to obtain optimum design results. 

This chapter starts by defining important terms in the context of ASIP 
design and embedded signal processing architectures. Afterwards, sev- 
eral important fields of ASIP applications are discussed together with 
references to ASIP case studies. Finally, the design space of ASIPs is 
defined and the impact of high level design decisions on performance, 
energy and area consumption is described. 



4.1 Definitions of ASIP Related Terms 

The technical literature uses the acronym AS/P to describe two different 
kinds of integrated digital circuits: 
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• Application-Specific Integrated Processor: This term represents 
any kind of application-specific digital integrated circuit used for 
data processing and does not imply any kind of instruction set ori- 
ented or programmable data processing [209]. 

• Application-Specific Instruction Set Processor or Application- 

Specific Instruction Processor^: This term represents a pro- 

grammable application-specific processor using the concept of an 
instruction set architecture for data processing. 



In this thesis, the term ASIP exclusively refers to an instruction set ori- 
ented processor with application-specific optimizations including op- 
tional tightly-coupled hardware accelerators. 

Figure 4.1 depicts typical classes of hardware implementation 
paradigms, which are bounded by pure ASICs on the left side and by 
general purpose processors on the right side. ASIPs can be viewed as a 
tradeoff between non-programmable application-specific integrated cir- 
cuits (ASICs) and domain specific signal processors (DSSPs). ASIPs 
are optimized for just one signal processing application. DSSPs are in- 
struction set oriented processors targeting a complete domain of signal 
processing applications (e.g. a network processor optimized for a class 
of different network processing tasks). Conventional off-the-shelf DSPs 
are less application-specific and target an even broader range of signal 
processing applications. 

The term instruction set architecture (ISA) defines the part of a pro- 
cessor that is visible to the programmer or compiler writer [107]. 

The term processor architecture (PA) extends the scope of an ISA by 
adding implementation characteristics that are hidden to the software: 
this chapter discusses PAs on the abstraction level of word parallel hard- 
ware operators. A PA contains a description of the processor resources 
(functional units and storage elements), of the interconnections between 
those resources and of the encoding/behavior of the supported instruc- 
tions. The instruction behavior determines the processor’s state transi- 
tions and the resource utilization of functional units. 



^In the technical literature the term Application-Specific Programmable Processor (ASPP) is a synonym 
for an ASIP in this sense, cf. [143] 
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Figure 4.1: ASIPs in the context of Processor HW implementation classes 



Each PA contains a data path, which comprises the functional units 
and storage elements for data processing. The architecture’s remaining 
parts constitute the control units, which control the data path like the 
instruction decoder or the interrupt control logic. This distinction is not 
applicable to units that exhibit data- as well as instruction-dependent 
behavior, such as the branch prediction unit. 

A pipelined processor uses pipeline registers to subdivide computa- 
tional tasks into a sequence of overlapping, subsequently performed 
subtasks. Each subtask is executed using the combinational resources of 
a a so-called pipeline stage. Dependencies between subsequent instruc- 
tions and resource conflicts result in pipeline hazards. Pipeline hazards 
are due to the dependence between instructions that are close enough so 
that the overlapping execution in the pipeline leads to a different access 
sequence of resources or data than in the case of non-overlapping exe- 
cution. Hennessy and Patterson [107] classify these pipeline hazards as 
follows: 

• data hazards: data dependencies between two instructions which 
would result in an incorrect behavior, if not properly resolved 

- read after write: instruction (i-i-n) tries to read a data value be- 
fore instruction (i) writes it, which results in the wrong order 
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of write and read access (instruction (i+n) reads the outdated 
value). 

- write after write: instruction (i+n) writes a data value before 
instruction (i) which results in the wrong order of the two 
write accesses (the earlier instructions wins this race). 

- write after read: instruction (i+n) writes a data value before 
instruction (i) reads it which also results in the wrong order 
of read and write accesses (instruction (i) reads the incorrect 
new data value). 

• control hazards: this is due to branch instructions resulting in a 
non- sequential program flow which has to be taken care of by in- 
serting pipeline bubbles (insertion of “no operation” into a stage) 
or by using the so-called “branch delay” slot(s). Branch delay slots 
are instructions after the actual branch instruction that are always 
executed, regardless if the branch was taken or not. 

• structural hazards: are due to resource conflicts of functional units 
or of the memory and typically result in pipeline bubbles as well. 



4.2 ASIP Applications 

Typical applications of ASIPs can be subdivided into the classical do- 
mains, where traditional /^-controllers and programmable digital signal 
processors (DSPs) in combination with dedicated hardware are used. 
In the last few years, a trend towards multi-threaded network processor 
(NPs) architectures optimized for network routing and switching appli- 
cations can be observed. 

Application classes for ASIPs can be subdivided into 

• control-dominated systems which react to (typically non- 
periodical) external events often with real-time constraints on the 
response time 

• data-dominated systems where complex transformation of data are 
performed using 
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- cyclostationary processing of data streams (operation se- 
quence is largely defined at compile time ) 

- non-cyclostationary processing (operation sequence is 
strongly data dependent) 

• a mixture of control- and data-dominated systems 

Examples for control-dominated systems are the above-mentioned net- 
work processors whereas typical cyclostationary processing of data 
streams can be found in many digital processing algorithms e.g. for 
filtering and equalizing data or for channel decoding [48]. Non- 
cyclostationary data processing is typically also a part of digital signal 
processing systems and can be found e.g. in digital receiver structures 
[179] that make use of different channel acquisition and tracking algo- 
rithms. 

From an ASIP centric point of view, the historical development of tradi- 
tional fixed DSPs can be regarded as the continuous attempt to find the 
optimum fit between the feasible hardware effort and the cost of a DSP 
on one hand, and the demands of quickly changing, popular applications 
on the other hand. This slowly developing process of DSP evolution has 
produced ASIP-lrke features in general purpose DSPs like e.g. 

• single cycle multiply-accumulate using the data bus and the pro- 
gram bus as sources for the multiplier (TMS320C2x [244]) 

• bit-reverse addressing mode e.g. for FFT-butterfly addressing 
(TMS-320C2x and C54x [245] and many others) 

• subword parallelism (corresponding to a SIMD extension) using 
two 16 bit numbers within a 32 bit word in order to perform 2 
multiply-accumulate operations in one cycle (Fucent 16000 [27] 
and others) 

• computation of a parallelized 2-unfolded FIR or HR using a delay 
register (Fode DSP [28]) 

• Viterbi extension for Add-Compare-Select in combination with 
dedicated storage for the survivor path (TMS320C54x [245]) 

• software pipelined Viterbi execution using two specialized Viterbi 
instructions (StarCore [186]) 
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• further SIMD extensions for filtering purposes (cf. TMS320C62x, 
TMS320C67X [246] and TigerSharc [10]) 

For a limited number of algorithms e.g. FIR/IIR-filters, FFT, distance 
calculations or even matrix operations it is obviously possible to opti- 
mize a fixed DSP instruction set architecture prior to fabrication. How- 
ever, if quickly evolving applications call for significantly different 
algorithms, these “optimized” DSPs might expose poor performance. 
In the worst case, an application might need an optimum implemen- 
tation for a mixed control- and data-dominated task, which calls for 
a mixed implementation using features of /^-controllers together with 
application- specific DSP features like in [189]. In such a case, a reason- 
ably designed ASIP that is solely optimized for the underlying applica- 
tion will certainly outperform available fixed DSPs and /i-controllers. 

In Table 4.1 some commercial and academic ASIP case studies are 
listed as examples for typical ASIP applications. However, ASIP de- 
sign has been common in the industry for a longer period of time in the 
form of in-house DSPs, which are intended for a specific application 
domain [201]. 



4.3 ASIP Design Space 

The following classification of processors focuses on architectural fea- 
tures that are relevant for the implementation of ASIP processor archi- 
tectures. 

Flynn’s classification [82] is the most popular and lucid processor clas- 
sification based upon the number of instruction and data streams that 
can be simultaneously processed. The processor categories are: 

• SISD (Single Instruction, Single Data), which is the the classical 
definition of a scalar uniprocessor. 

• SIMD (Single Instruction, Multiple Data), which defines the class 
of vector/array processor. 

• MISD (Multiple Instruction, Single Data) is often considered irrel- 
evant in practice. Nevertheless, instruction level parallel architec- 
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Application 


Authors, Affiliation 
and Reference 


Design 

Environment 


MPEG-I 

decoding 


P Ploger, J. Wildberg 
GMD [207] 


CASTLE 


MPEG-II enc., 
LMS Adaptive 
Eiltering 


S. Balakrishnan et al. 
Univ. of Twente [21] 


SYMPHONY 


UNIX ’’crypt” 


V. Zivkovic et al. 
[283] 


MOVE 


Java Processor 


Serfio Akira Ito et al. 
UFRGS - Brazil [125] 


" 


ATM cell 
processing 


S. Virtanen et al. 
TUGS Finland [265] 


TACO 


Vector 

Processing 


M. Gschwind 

IBM Research Center [101] 


“ 


MD5 encryption 
SHA 


P. Faraboschi et al. 

HP Lab. and STM 
Cambridge (MA) [77] 


Lx platform 


JPEG2000 
among others 


D. Chuang 

Improv Systems Inc. [51] 


Improv Design 
Platform 


FIR, JPEG, 
Viterbi, Motion 
Detection, DES 


R. E. Gonzales 
Tensilica Inc. [97] 


XTENSA Proc. 
Design Platform 


RISC+DSP 

applications 


ARC Cores Ltd. [12] [13] 


ARC ARCtangent- 
A4DSP 



Table 4.1: ASIP Case Studies 



tures (like VLIW or superscalar architectures) with a non-parallel 
load/store unit and a single I/O port are part of this class. 

• MIMD (Multiple Instructions, Multiple Data), which covers 
the range of many instruction level processors and multi- 
processor/computer systems. 



Flynn’s classification is a good starting point for the following design 
space definition to differentiate between the non-parallel SISD architec- 
tures and the parallel SIMD and MIMD architectures. In Figure 4.2 a 
classification of parallel architectures (similar to [227] with minor mod- 
ifications) is depicted. 
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Figure 4.2: Design Space of Parallel Architectures (Sima [227]) 



According to this classification, typical DSP applications like FIR, vec- 
tor or matrix computations obviously represent a good match with data- 
level parallel architectures. In fact, efficient hardware implementations 
of these algorithms use dedicated hardware structures, which resem- 
ble the data paths of these instruction set architecture classes (mostly 
with further application-specific optimization like in [48]). The SIMD 
principle is also used by some commercially available DSPs (e.g. the 
C6X DSP from Texas Instruments [246] or the Trimedia Processor from 
Philips [205]) by implementing SIMD instructions to support multiple 
parallel operations on register subwords. For high level languages the 
compiler has to “vectorize” the code in order to target these architec- 
tures efficiently. This vectorization is difficult for high level languages 
like C and C-t-i- without explicit support of vector and matrix opera- 
tions. This is one reason, why VLIW architectures, which avoid this 
issue, have become popular both for general purpose (e.g. for Intel’s 
EPIC architecture [157]) and digital signal processing applications. A 
good overview of this topic is given in [122]. 

On the other hand, superscalar processors tend to have significantly 
more complex hardware, which is needed to exploit instruction level 
parallelism during program runtime. This extra hardware also needs 
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significantly more silicon area and energy consumption, which is pro- 
hibitive for energy critical, embedded digital signal processing applica- 
tions. 

Multi-threaded processors are used in particular in the area of network 
processors. Multi-threading can be generally applied to utilize the func- 
tional units of a processor more efficiently. This concept typically is 
beneficial to hide memory latencies in order to increase the proces- 
sor’s throughput without affecting the computational latency for a single 
thread. For tasks with regular data access patterns in time critical tasks, 
however, conventional DSPs with optimized memory organization are 
often more suitable. 

Process-level parallel architectures and systems are common for em- 
bedded systems in order to balance the workload of one processor. 
Typically, a combination of distributed memory with shared memory 
or dedicated inter-processor communication resources is used to avoid 
communication bottlenecks for number crunching algorithms [213]. 

The taxonomy in Figure 4.2 still lacks many important architectural de- 
tails, which are of practical relevance for ASIPs. The following sub- 
sections classify the ASIP design space with respect to important archi- 
tectural features. Each subsection describes one group of related, but 
orthogonal design parameters. 



4.3.1 Functional Units 

The functional units represent the data paths’s elements of a processor. 
The following characteristics and parameters can be identified for the 
functional units: 



• granularity: bit serial, word serial or word parallel operation 

• word width, number of parallel words etc. 

• arithmetic: fixed point, block floating point or floating point 

• operation(s): e.g. integer arithmetic, complex arithmetic, boolean 
operations, Galois field operations, etc. 
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• configurability: fixed operation (e.g. signed multiplier) or config- 
urable operation (signed and unsigned multiplier) 

• single vs. multi-cycle operation 

These characteristics are sufficient to span the design space for the be- 
havior of the functional units. Further aspects like control and pipelin- 
ing of these units are covered later on. 



4.3.2 Storage elements 

Storage elements in a processor system (including data and instruction 
memories) are used to temporarily store data and control information. 
The following list of characteristics determines the organization of stor- 
age elements for an ASIP: 

• word width and number of addressable words in the storage ele- 
ment 

• register organization: orthogonal register file, split register files or 
distributed registers 

• location of memory: on-chip memory or external memory 

• access time of memory: number of processor cycles to read/write 
data 

• memory organization: one memory for instructions and data (von 
Neumann architecture [192]) or different memories for instruc- 
tions and data (Harvard architecture) [55] 

• memory hierarchy: flat instruction memory or hierarchical organi- 
zation (using caches) 

• data memory organization: 

- single or dual ported memories 

- single memory bank or several data memory banks for simul- 
taneous access 

• instruction memory parallelism: sequential read of instruction or 
parallel fetch of several instructions 
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• instruction memory type: synthesized or hard-macro ROM, boot- 
loadable RAM or a combination of ROM and RAM 

This classification includes the well-known register-register architec- 
tures (which use a data register file) as well as the register-memory and 
the memory-memory architectures (typically with distributed internal 
data registers). 

An orthogonal aspect of storage elements is how the processor accesses 
them. For register, which are connected to just one functional unit, 
this access is straightforward, because it is determined by the dedicated 
connection of this register. General purpose register files offer a limited 
number of read and write ports and are often connected to data/address 
buses or multiplexer structures. Data memory accesses are typically 
controlled by special load/store units. Depending on the data memory 
organization, one or several simultaneous read/write operations can be 
performed. Accesses to the same memory bank have to be restricted by 
the load/store unit to just one access (two accesses) per cycle for single 
(dual) port memories. 



4.3.3 Pipelining 

The concept of pipelining in a processor can be applied to single combi- 
national functional units or to subdivide groups of functional units into 
different stages for instruction execution. The concept of pipelining 
is not orthogonal to the organization of storage elements in the previ- 
ous subsection, because it introduces additional storage elements to the 
architecture. The purpose of additional pipeline register is not primar- 
ily to store data rather than to increase the computational performance 
(sometimes also to increase the energy-efficiency like in Section 7.1). 

Pipelining of single combinational functional units increases the maxi- 
mum clock frequency of this unit and, thus, increases the possible com- 
putational throughput. This is especially useful, if the same computa- 
tion has to be performed for a series of input data. Pipelining is also a 
technique to utilize functional units more efficiently, because a compu- 
tation is partitioned into subcomputations that are executed in parallel 
for a series of input data in analogy to the concept of an industrial as- 
sembly line. 
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Pipelining can also be used on a coarser grain of abstraction to sep- 
arate different groups of functional or control units from each other. 
A typical pipeline organization of a RISC processor uses the pipeline 
stages instruction fetch, instruction decode, read operand, execute and 
write-back operand (cf. Figure 4.3). Pipelining enables higher oper- 
ating frequencies. On the other hand, data and resource dependencies 
of different stages lead to pipeline hazards, which effectively reduce 
the utilization of the pipeline stages. For a more detailed discussion of 
pipeline hazards refer to Section 4.1 and [107]. 
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Figure 4.3: Example RISC Processor Pipeline 



The total processing time Tpip^ to process n instructions with a linear 
pipeline of s stages is 



Tpipe = {n + s- l)Tak (4.1) 

for a clock period of Tcik- In the following ideal consideration, pipelin- 
ing overhead due to setup times of real flip-flops is neglected. In the 
limiting case of identical critical timing paths Tdk of each pipeline stage 
the equivalent unpipelined architecture needs 
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T 

unpipe 



nsTcik 



(4.2) 



for the same eomputation. 

As a result the speedup faetor of pipelining is 



T 

^ unpipe 

Ur) — 



ns 



T, 



pipe 



(n + s — 1) 



(4.3) 



If the additional area for the pipeline registers is negleeted, pipelining 
leads to an inereased arehiteetural effieieney (ef. Seetion 3.1.1): 



^arch,pipe ^ ^arch,no^ipe ( 4 . 4 ) 

4.3.4 Interconnection Structure 

The intereonneetion between funetional units and storage elements de- 
termines the flexibility of a data path, whieh is the most important dis- 
tinguishing feature between more dedieated and general purpose data 
paths. 

Basieally, there are two options for the intereonneetion between two 
nodes: unidireetional or bidireetional intereonneetion. Unidireetional 
intereonneetion is implemented using a simple wire between the output 
of the eonsuming and the input of produeing node. Bidireetional in- 
tereonneetion are more eomplieated, beeause the designer has to make 
sure that the required bidireetional drivers are not simultaneously driv- 
ing the bus with different logie values, whieh would result in short eir- 
euits. This problem ean be avoided by using separate input and output 
ports for eaeh node together with separate unidireetional intereonnee- 
tions between these ports. However, this approaeh needs more silieon 
area, whieh would be a signifieant drawbaek for system buses. 

There is a large variety of possible different intereonneetion networks 
e.g. using binary trees, stars, meshes or systolie arrays [120]. How- 
ever, all these intereonneetion networks ean be eonstrueted using two 
fundamental topologies: 
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• one output producing information for one or several inputs 

• one or several outputs producing information for one input 



In Figure 4.4 the hardware implementation for these two options is de- 
picted. It is obvious that the left implementation does not require any 
additional combinational hardware, whereas the right implementation 
needs a multiplexer or tristate output drivers (with additional control 
units). The overhead due to interconnections (especially in the case of 
non-tristate buses) can be significant for a highly configurable target 
architecture due to excessive relative interconnection silicon area and 
delays of deep sub micron technologies with respect to combinational 
logic [19]. For that reason, the interconnection network should be care- 
fully dimensioned preferring local over costly global communication 
and minimizing the interconnection flexibility as far as possible for the 
target application. 
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Figure 4.4: Basic Network Topologies 



4.3.5 Control Mechanisms 

There are two different mechanisms to control the data path of an in- 
struction set oriented processor 
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• time-stationary coding: the instruction controls exactly one state 
transition of the complete data path 

• data- stationary coding: the instruction travels together with the as- 
sociated data in the pipeline and controls the sequence of opera- 
tion(s) performed on these data in each pipeline stage 

As reported in [98], many ASIPs use time- stationary coding, because 
the programming and verification of these architectures is facilitated. 
However, for more deeply pipelined architectures pure time- stationary 
coding is inefficient due to a large number of redundant configuration 
bits needed for the instruction. For these architectures data- stationary 
coding is obviously more suitable (cf. [91] for an example). 

The design of pipelined data paths using data-stationary coding requires 
the following design decisions: 

• open pipeline: The pipeline is fully visible to the programmer and 
the programmer has to take care in order to avoid structural and 
data hazards (which both would lead to incorrect program behav- 
ior). The same is valid for control hazards: the programmer has 
to fill the delay slot(s) with valid instructions after each control 
instruction. 

• interlocked pipeline: The pipeline is not visible to the programmer, 
because the hardware takes care of structural, data and control haz- 
ards by using 

- pipeline interlocking (stalling of previous pipeline stages) in 
order to resolve these dependencies 

- forwarding and register/memory bypassing to avoid stall cy- 
cles by smarter data routing 

For the processing of program loops, special hardware support for zero- 
overhead loop processing can be implemented. This hardware replaces 
the instructions at the end of the loop (increment/decrement of loop 
counter, compare with end value, branch on this condition) and avoids 
the associated branch penalty. This kind of hardware loop support has 
been implemented in many DSPs and ASIPs e.g. [186] [97] and [89]. 
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In the last few years, conditional instruction execution has become pop- 
ular for deeply pipelined processor architectures. Conditional or predi- 
cated execution means that the execution depends on a special condition 
or predicate register. This condition/predicate bit can be set by e.g. a 
comparison instruction to implement a HLL statement like “if (...) then 
... else ...” without using conditional branches. Consequently, control 
hazards have been avoided using this approach. 

Residual control of functional units is sometimes applied to configure 
e.g. the saturation/overflow mode of an ALU. This mechanism uses a 
dedicated control register and is especially beneficial, if changes of this 
residual configuration (which can be modified by a processor instruc- 
tion) are rare. 

Distributed or centralized control can be used for processors with dis- 
tributed functional units and split registers or for multi-processor sys- 
tems. As typical ASIPs mostly use simple structures with local func- 
tional units, a more thorough investigation of distributed architectures 
clearly is beyond the scope of this thesis. 



4.3.6 Storage Access 

The access methods for the storage elements of a processor can be sub- 
divided into register access and memory access . 

Register access for dedicated registers that are connected to just one 
functional unit are simply controlled by the instruction type (e.g. ALU- 
or multiply-instruction). Access to a data register file with multiple 
internal registers and with multiple read/write ports typically has to be 
controlled by special operand fields (“register” fields) of a processor 
instruction. The input data for a register write operation typically are 
produced by either a functional unit, by the memory or are extracted 
from an immediate field of the instruction. 

Memory accesses often use more elaborate addressing schemes: 



direct or absolute addressing: the address is directly extracted from 
the instruction “direct address” field 
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• indirect or register deferred addressing: the address is taken from 
a (data or address) register 

• indirect with displacement: same as indirect but with an additional 
displacement extracted from the instruction 

• indexed addressing: the address is calculated using two (data or 
address) registers (often the effective address is just the sum of the 
two registers values) 

Furthermore additional so called “post-”operations are often associ- 
ated with the above-mentioned addressing modes. The most simple 
post-operation is post-increment/decrement, which is used to incre- 
ment/decrement the associated address register by a constant in order 
to make it point to the next data address in memory. More sophisticated 
post-operations include the popular addition with reversed-carry chain 
propagation, which is used for FFTs. 

4.3.7 Instruction Coding and Instruction Fetch Mechanisms 

Instruction coding determines two important aspects of an ASIP im- 
plementation: the program memory size^ and the implementation flex- 
ibility. A decrease of the instruction width obviously reduces the in- 
struction memory width, but it also reduces the flexibility of the en- 
coded instructions. For instance, a RISC instruction format with three 
register operand fields enables operations like (i?3 = i?l -f R2)instri 
using just one instruction. A two operand instruction format has to 
use two instructions for the same operation (i?3 = R2)instri', (7?3 = 
i?3 -f Rl)instr 2 - Even for this simple example, the effect on overall in- 
struction memory size depends on the application program. This fact 
is exploited by application-specific processors, where the processor de- 
signer can optimize the instruction encoding within the following two 
bounds (which represent extremes w.r.t. instruction width and flexibil- 
ity): 



• micro-coded instructions, which offer the highest possible flexibil- 
ity and need the widest instruction memory (an elaborate instruc- 
tion decoder is unnecessary in this case) 

*The size of the program memory also has a considerable impact on energy consumption 
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• application-specific compressed instruction encoding obtained by 
enumeration and binary coding of all different instructions in a 
given program (this heavily restricts the flexibility of the sup- 
ported instruction to the set of instructions, for which the encoding 
has been performed but yields the minimum possible instruction 
width) 

For many practically relevant cases, an instruction coding that encodes 
the available operations using a fixed length instruction field is used. 
Furthermore, the operation’s operands like register or memory operands 
typically have to be programmable, thus, requiring associated operand 
fields in the instruction. For less orthogonal instruction set architectures 
these operand fields can be partially omitted using partially hard coded 
operands for one or more instructions. This reduces the memory foot- 
print, which can be exploited by ASIPs [91], provided that the decrease 
in flexibility of the ISA is acceptable. 

Typically, it is a challenging task for a given application to find the 
optimum instruction coding that represents a feasible tradeoff between 
flexibility and code size. 

The instruction coding also has an impact on the memory organization 
and on the instruction fetch stage. Instruction fetch mechanisms de- 
scribe the way instructions are routed from the instruction memory to 
the instruction decoder. For scalar architectures with a single instruc- 
tion fetch per cycle this mechanism is trivial. However, for VLIW or 
superscalar architectures with parallel instruction fetch, an efficient and 
more complex fetch mechanism is essential to keep the parallel data 
path busy. 

Basically, there are two popular, commercially used coding schemes for 
VLIW processors, which impact the instruction fetch stage: 

• uncompressed VLIW encoding 

• various compressed encoding schemes 

Figure 4.5 depicts the principle of these two different schemes. The un- 
compressed VLIW encoding typically uses one bit field that controls the 
operation of each functional unit of the data path. This results in a sig- 
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nificant waste of instructions bits for the case of non-parallelizable in- 
struction execution, where unused bit fields have to be explicitely filled 
with horizontal “no operation” patterns [76]. The example for a com- 
pressed VLIW encoding in Figure 4.5 is similar to the scheme in [153] 
or [246] where the “P”-bit in each instruction is used to indicate that 
the following instruction can be issued in parallel. The disadvantage of 
compressed VLIW schemes is the additional hardware effort to decom- 
press the instructions, to allocate the associated processor resources (if 
a specific resource is not defined by the instruction e.g. in the case of 
identical, replicated functional units) and to dispatch the instructions to 
the desired location. This decompression step can conceptually be seen 
as a mapping of the compressed instruction stream to a normal uncom- 
pressed VLIW representation as depicted in Figure 4.5. 

More elaborate compression schemes use compile time compression, 
which reduce the code redundancy by using statistical methods [164] 
[278] [139]. These schemes require runtime decompression by hard- or 
software resulting in a potential performance degradation. Furthermore, 
the effort for architecture design and verification might increase signif- 
icantly, because runtime decompression introduces several issues e.g. 
more complicated branch processing, which results in an unorthogonal 
architecture. 

For embedded applications like the DVB-T receiver of [88] the instruc- 
tion memory resides on-chip as a ROM. For field reprogrammable ap- 
plications, however, the instruction memory is either implemented as an 
on-chip RAM or external memory is used. One constraint of the cod- 
ing width in case of external memories is the bit width, which is often 
restricted to a multiple of 8 or 16 bits for off-the-shelf external memory 
elements. 



4.3.8 Interface Mechanisms 

Input and output (FO) mechanisms for data and control information 
both affect the ASIP hardware as well as the ASIP software. Commu- 
nication can be performed between the ASIP and other on- or off-chip 
devices like processors, dedicated hardware or analog components (like 
e.g. AD/DA converters). The following taxonomy describes the inter- 
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Uncompressed VLIW Format 



Load/Store Field | ALUl Field | ALU2 Field | Multiplier Field 




Example for Compressed VLIW Format 

Instruction! |P| Instruction 2 |P| Instructions |P| | Instruction N |P 



T 

Instruction Decompressor, 
Resource Allocator 
and X-Bar Switch 



T 

I Load/Store Field | ALUI Field | ALU2 Field | Multiplier Fiekf 
internal VLIW format 

T T T T 




Figure 4.5: VLIW instruction formats 



face mechanisms from the ASIP perspective: ASIP-external implemen- 
tation characteristics of interfaces which use e.g. dedicated connections 
or shared system buses are not covered. 

Depending on characteristics like the data rate and the number of trans- 
ferred data or control samples per iteration of an algorithm (in the case 
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of cyclostationary data processing) different I/O mechanisms can be 
used: 



• Memory -based I/O: Data is exchanged with a shared memory. 
This concept is typically suitable for a larger amount of data words 
per iteration, which enables high data rates due to a low overhead 
per sample. 



• Register-based I/O: either dedicated (ASIP internally read- 
/writeable) registers or memory-mapped registers that can be ac- 
cessed similar to ordinary memory storage locations are used for 
data transfers. This concept is typically suitable for a smaller 
amount of data words per iteration, because of the large silicon 
area consumption of semi-custom registers. The data rate in this 
case is typically smaller than for the shared memory approach due 
to a considerable overhead per sample, which is needed to syn- 
chronize the data. 



• Dedicated control channels, which typically affect the program 
flow for synchronization purposes (using e.g. synchronous reset 
signals, interrupt vectors, start-stop and/or resume-suspend signals 
for certain tasks including above-mentioned data transfers). 



These I/O mechanisms have to be supported by appropriate hardware: 
e.g. in the case of shared memory, a memory arbiter has to be imple- 
mented, whereas in the case of dedicated control channels, some sort of 
direct memory access (DMA) controller functionality is needed like in 
[30]. In the case of communication based on dedicated registers, spe- 
cial instructions have to be implemented to access them. For memory 
mapped registers either a reserved I/O address space has to be used in 
combination with conventional load/store instructions or, alternatively, 
an orthogonal I/O address space together with additional FO instruc- 
tions is needed. Finally, dedicated I/O ports like in conventional ASIC 
hardware blocks have to be used e.g. for handshake, start and stop sig- 
nals. This interface mechanism can be supported by special instructions 
and/or by some kind of program flow or program interrupt controller. 
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4.3.9 Tightly- Coupled ASIP Accelerators 

Accelerators are optimized dedicated hardware structures, which are 
typically able to perform a very limited set of computational tasks. 
Tightly-coupled ASIP accelerators can be viewed as elaborate func- 
tional units in the ASIP, which are integrated in the instruction set archi- 
tecture. This integration is reflected by the interface and control mech- 
anisms that are typically used. The interface between the ASIP core 
and the accelerator are either realized using specialized internal reg- 
isters or simply the general purpose register file. On the other hand, 
the control of accelerators can be performed by using specialized ASIP 
instructions like in [15] and/or specialized registers similar to residual 
functional control like in [180]. The difference between a tightly cou- 
pled ASIP accelerator and an ordinary functional unit in the ASIP is the 
fact, that accelerators typically implement more complex sequences of 
operations. This requires a more complex internal structure often with 
an ASIP-independent controller. 

Accelerators are typically used, if at least one of the following condi- 
tions is fulfilled: 



• extremely high computational performance is required, which can 
not be satisfied by modification of ordinary functional units 



• extremely high energy-efficiency is needed 



In addition to the implementation of the already mentioned interface 
and control mechanisms of ASIP accelerators, the designer has the same 
degrees of freedom for their internal implementation than for dedicated 
hardware blocks: The most important decision is based on the trade- 
off between additional area consumption (which corresponds to the de- 
gree of parallelism in the accelerator implementation) and the additional 
computational performance. However, there are further important deci- 
sions that affect the overall implementation flexibility: Ideally, ASIP 
accelerators should only be used for tasks with a very low probability 
of late design changes. This strategy minimizes the risk of a chip re- 
design due to late design changes. 
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4.4 Critical Factors for Energy-Efficient ASIPs 

The question arises, whieh of the design axes of Seetion 4.3 are most 
important in order to implement eomputationally optimized, energy- 
effleient ASIPs. The issue behind that question is, that there is no single 
ASIP applieation that represents all possible applieations in the world^. 
This in turn makes it hard to deduee eommon properties and propose 
eommon guidelines. Eaeh ASIP applieation has its own eharaeteristies 
eoneerning typieal operations, typieal data transfer sehemes, data rates 
and additional eonstraints. Obviously, the right question to ask is, how 
the designer ean find the eritieal parameters and how to tune them in 
order to aehieve a eertain applieation-speeifie design goal. These ques- 
tions related to the design flow will be answered in Chapter 5. Prior to 
tuning parameters it is important to understand the prineipal effeets of 
important design deeisions. This is the foeus of the following subsee- 
tions starting with the typieally most important timing and performanee 
eonstraints. Afterwards, the impaet of ASIP modifleations on energy 
and area eonsumption is diseussed. 



4.4.1 Timing and Computational Performance 

Many high level ASIP design approaehes like [17] or [70] use the ab- 
straetion of maehine eyeles as a metrie to evaluate the result of a design 
modifleation. This approaeh does not eonsider the impaet on low level 
timing (eritieal path Tcru) of synehronous logie using edge-triggered 
flip-flops [67]. The eritieal path Tcrit of sueh a eireuit determines the 
maximum operating eloek frequeney fmax = ^ /Tcrit- 

The ehange of Tcrit can be signifieant in some eases e.g. in [101], where 
the dramatie effeet of small, ineremental modifleations of funetional 
units, the eontrol logie and the memory units on the maximum eloek 
frequeney is evaluated. An inerease of up to 30% in the ease of small 
ehanges in the ALU and an inerease of up to 60% in the ease of a modi- 
fleation in the braneh unit emphasize the importanee for early low level 
hardware estimations. 



^This issue is in analogy to “the Ultimate Question of Life, the Universe and Everything. All we know 
about it is that the answer is OblOlOlO” [4]. 
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Two design approaches are possible: 



• the critical path Tcrit is constrained by the ASIP system environ- 
ment (typical low-end embedded application scenario) 

• the ASIP is running (nearly) at the maximum speed fmax and the 
minimization of the total runtime Trun,min = TcritNcyc of a task 
(which requires Ncyc processor cycles) is the optimization goal 
(high end application scenario) 



In the first case, the critical path of the ASIP optimization is upper- 
bounded by the system clock frequency. During ASIP design, it has to 
be guaranteed that this constraint is not violated by any ASIP modifica- 
tion. This means that after each major or minor hardware modification, 
the hardware estimation design flow (refer to Section 5.3) has to be re- 
peated in order to check this constraint. This methodology is in analogy 
to conventional HDL-based hardware design, where automatic synthe- 
sis has to be regularly performed after design modification to check low 
level constraints. In order to obtain a moderate design time, while ex- 
ploring a sufficiently high number of different ASIP implementations, 
it is mandatory that this hardware design flow should be automated to a 
large extent. 

In the second case, incremental ASIP changes are performed with the 
goal to minimize the total runtime Trun,min of a task. This might be 
useful in the case of programmable accelerator chips (e. g. for high-end 
graphics applications like [58] or [106]), where high data throughput 
is a competitive advantage. In order to achieve this goal, the prod- 
uct TcritNcyc has to be reduced. As previously mentioned, even small 
changes to the ASIP architecture that reduce Ncyc can lead to a signif- 
icant increase in Tcrit- To make matters worse, the reduction in Ncyc 
is typically strongly application-specific, thus, late design changes of 
the application might lead to suboptimal performance. An example for 
such a worst case is the scenario shown in [77], where the instruction set 
has been (over-) optimized for MD5 {message digest) encryption, which 
was actually harmful for a different algorithm (SHA - Secure Hash Al- 
gorithm [220]). Such a worst case represents the bound of flexibility 
and the risk of over- specialization of an ASIP implementation. 




4.4. Critical Factors for Energy-Efficient ASIPs 



67 



However, properly designed ASIPs typically take advantage of changes 
in the data path, without significantly affecting the critical path. This 
can be achieved by parallelization of computations using parallel func- 
tional units supported either by replicated decoders with an additional 
dispatcher like in Chapter 7.2 or by specialized instructions. The in- 
terconnection structure of a parallelized data path has to be designed 
carefully to avoid communication bottlenecks in large general purpose 
registers or large power and area consuming switch matrices like in [8 1 ] . 
On the other hand, approaches that emphasize the chaining of opera- 
tions [262] risk to increase the critical path. If the increase of Tcru can 
be tolerated (e.g. because it does not violate the clock constraint of the 
system environment), it has to be (over-) compensated by a correspond- 
ing decrease of Ncyc in order to achieve a benefit for the total execution 
time. However, if the increase of Tcru can not be tolerated, retiming 
of logic can be performed and/or additional pipeline registers/multi- 
cycle operations can be introduced. Retiming (cf. Subsection 3.3.2), 
which can be manually or automatically performed, has the goal to bal- 
ance the delays of combinational logic in different stages. Retiming is 
only possible, if the critical cyclic graph of logic contains at least two 
registers. Multi-cycle operations mixed with single cycle operations are 
obviously feasible, but they make the implementation less orthogonal 
and increase the verification effort. Introduction of additional pipeline 
stages also tends to increase the penalty for taken branch instructions 
and increases in turn Ncyc (refer to Section 7. 1 for an example, where 
this has happened). 

If the above-mentioned approaches to ASIP performance optimization 
fail to meet the constraints of an application, the implementation of a 
tightly-coupled ASIP accelerator is an option. This implementation, 
however, corresponds to a shift of the ASIP implementation towards 
more dedicated hardware, which has to be carefully considered in order 
to avoid an unnecessary decrease in the overall implementation flexi- 
bility. In many cases it is possible to use the accelerator for a limited 
subset of a runtime-critical task (e.g. a loop body like in [222]) which 
actually does not require a significant amount of flexibility. 

If the application exposes a significant amount of data parallelism, it 
might be advantageous to implement parts of the ASIP as data parallel 
architecture. In this case, appropriately high memory bandwidth has to 
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be provided in order to keep the functional units busy. One of the most 
important advantages of processing elements and memories on a single 
chip is the fact, that memory bandwidth is (theoretically) only bounded 
by the exploitable data parallelism of the application and the available 
silicon area (both for memories and functional units). For off-chip com- 
munication the chip pad limits are a major cost factor and obstacle to 
implement a high bandwidth interface. This fact naturally leads to a het- 
erogeneous, non-hierarchical, but partially parallel memory architecture 
with small, fast scratch pad memories for intermediate values and larger 
(and possibly slower) main memories. The use of a memory hierarchy 
with level one, level two and main memory would also be an option 
to increase the bandwidth to main memory. In the case of large off- 
chip external main memory, this concept is needed in order to decrease 
the memory latency of each access. For typical cyclostationary DSP 
kernels with a limited amount of required data storage, however, the 
memory access schemes are regular and easily predictable, thus, the in- 
troduction of cache memories optimized for irregular (general purpose) 
access patterns is typically overhead. This is one of the main differences 
between ASIPs and general purpose digital signal processors like TFs 
TMS320C6x [246], which extensively uses such a cache hierarchy at 
the expense of a decreased energy-efficiency. 

Finally, the data I/O for high performance ASIPs is a critical factor, be- 
cause it can decrease the utilization of functional units, if the processor 
itself has to take care for it. One solution to this is a high speed DMA 
controller with exclusive access to parts of the ASIP memory. An even 
more elaborate scheme is the combination of conventional direct mem- 
ory access controllers with a suitable double buffering scheme. Double 
buffering reserves two parts of the (shared) memory: one part is used 
by the DMA controller and the other part by the ASIP. After DMA and 
ASIP data processing has finished, a simple control logic exchanges the 
two parts of the memory in order to enable DMA access to the second 
part and ASIP access to the first part. 



4.4.2 Energy Consumption 

The effects of architectural changes on the energy consumption for 
a given computational task are more complicated than the above- 
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mentioned effects on computational performance. This is mostly due 
to the statistical nature of power consumption, which is affected by data 
correlation of subsequent binary values on the nodes. This fact requires 
detailed tool-supported power analysis for the relevant operation sce- 
narios. 

The following discussion uses the abstraction of word-level hardware 
operators, which is the natural level of abstraction for HDL-based hard- 
ware design. In analogy to [54] where the term intrinsic computational 
efficiency of silicon has been introduced the following terms are defined 
for simplification purposes: 

• In case of a full match between application and architecture, each 
hardware operator (like e.g. an adder or multiplier) is contributing 
a useful calculation to the overall computational task. For a large 
set of stimuli, the average energy consumption of this ideal archi- 
tecture can be calculated, which shall be called Intrinsic Com- 
putational Energy Ei. In the case of an adder this energy is a 
function of the technology, the operating conditions (like supply 
voltage and temperature) and of the adder implementation (energy 
evaluations of different adder implementations can be found in e.g. 
[190] or [196]). 

• The difference between the overall energy Etot of a synchronous, 
instruction set oriented processor (including all the memories that 
are needed to process the task under consideration) and the in- 
trinsic computational energy E^ is called Overhead Energy Eovhd 
is needed for control logic (including the program memory), data 
memories, data transfers between processor units (routing energy), 
additional spurious transitions (other than those that are already 
included in the intrinsic energy of the operators), and the energy 
consumed in the clock tree and in the registers. 

The intrinsic energy E^ is the lower bound in energy consumption that 
is needed by an ideal (dedicated) hardware data path without any ad- 
ditional overhead energy due to glitches or clock networks. Even opti- 
mized real hardware needs either energy for a clock network and regis- 
ters or - in the case of a pure combinational network - it needs additional 
energy for unavoidable spurious transitions (glitches) due to the signal 
timing slack of intermediate results. 
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The percentage of overhead energy to overall energy can be significant: 
[223] reports an overhead energy of at least 64% and [231] and [249] 
estimate overhead energies of at least 79% and 70% respectively. In 
[124] a range between at least 61% for embedded processors and at 
least 72% for high end processors is reported. The fact that the overhead 
energy for a processor is of this order, agrees with the results of the case 
study in Section 7.1 of this thesis. 

ASIP optimization in order to lower the energy consumption has to de- 
crease the overhead energy. One solution to achieve this, is to decrease 
the runtime of the given task by application-specific data path optimiza- 
tions and/or by an optimized software implementation. Figure 4.6 il- 
lustrates this effect, which relies on the assumption that the overhead 
power is (nearly) unaffected by the optimization. Obviously, not all of 
the architectural changes that have been described in the previous sub- 
section are able to meet this assumption: 

• Parallel functional units or ASIP accelerators that represent a 
close match to the application’s control data flow graph are a typi- 
cal example for efficient low-power data path optimization. The 
principle of this technique is to increase the rate of operations 
without increasing the rate of instructions. This implicitly requires 
dedicated application-specific instructions to support the increased 
parallelism in the data path. This optimization leads to architec- 
tures that are beyond the typical SIMD class of processors, because 
the data path is not restricted to perform the same computations on 
a set of data. A single highly optimized ASIP instruction can rather 
trigger a number of arbitrary data processing operations. 

• If chaining of operations is possible without violating the time 
constraints and without introducing additional registers, this mod- 
ification also tends to increase the energy-efficiency. It also typi- 
cally requires adding one or more instruction to the ASIP instruc- 
tion set, which leaves the overhead energy nearly unchanged (at 
least in a simple single issue processor). Unfortunately, if new 
interconnections between e.g. the general purpose register and 
the chained operators have to be introduced, the size of the inter- 
connection networks increases, which in turn increases the overall 
data routing energy for any data transfer on the modified intercon- 
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Figure 4.6: Principle of Energy Reduction with Optimized ASIP Architecture 



nections. The overall effect of this modification has to be thor- 
oughly evaluated in each case in order to find out, if the modifica- 
tion was successful. 

• Additional pipeline registers to increase the pipeline depth have 
several effects: Firstly, pipeline register reduce the spurious activ- 
ity by resetting the signal slack (the difference between the earliest 
and the latest signal arrival event) to nearly zero. Secondly, addi- 
tional pipeline registers need a larger clock tree and the registers 
themselves require additional energy due to clock activity. Finally, 
additional pipelines possibly result in larger branch penalties and 
more complicated logic to detect and resolve hazards. The case 
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Study in Section 7.1 evaluates several pipeline depths in order to 
find out, which effect dominates. 

• Similar effects occur in the case of multi-cycle operations de- 
pending on the fact, whether additional pipeline registers or ad- 
ditional control logic has been used. If the multi-cycle operation 
is associated to one processor instruction that replaces a sufficient 
number of simple instructions, this optimization also reduces the 
energy of the instruction memory and of the decoder. 

• If retiming is used to decrease the critical path of an implementa- 
tion and enables the synthesis tool to take advantage of the in- 
creased degrees of freedom for low-power logic reorganization 
[45]. It also enables the use of slower, more power efficient op- 
erator implementations if available. According to [161] and [183] 
retiming can also be used to decrease the switching power of se- 
quential circuits. Precise knowledge of the switching activities and 
the capacitances of the circuit is needed for this optimization. 

• Sufficiently high data and instruction memory bandwidth as well 
as sufficiently high I/O rates have to ensure an optimum proces- 
sor resource utilization, which is also a means of decreasing the 
overhead energy. In case of unavoidable no-operation cycles of 
functional units due to memory wait states, the processor’s over- 
head energy should be reduced as far as possible (e.g. by clock 
gating and sleep modes). 

Apart from reducing the overhead energy by reducing the runtime of a 
task, the overhead power Povhd can be reduced directly e.g. by optimiz- 
ing one of the following: 

• high coding density reducing the size and the energy consumption 
of the instruction memory (this corresponds to a high average ra- 
tio of executed operations per instruction bit - this metric is also 
implicitly optimized by SIMD extensions, operator chaining and 
ASIP accelerators) 

• optimized instruction encoding (refer to Subsection 7.1.3) reduc- 
ing the energy consumption within the instruction memory or al- 
ternatively, the switching energy of an external instruction bus 
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• reduced data memory accesses by software optimization together 
with a sufficiently high number of local registers 

• optimized data encoding for high capacitive nodes 

• application-specific (limited) interconnections between functional 
units or within an ASIP accelerator to decrease the data routing 
energy by decreasing the effective switched capacitances 

• guard logic to reduce/avoid spurious transitions in combinational 
logic (refer to Subsection 3.3.2 for further explanations) 

• clock gating in order to shut down idle parts of the clock tree and 
to reduce the energy needed in the connected flip-flops 

All of these direct optimizations lead to a more energy-efficient imple- 
mentation typically without a negative impact on computational perfor- 
mance or silicon area. 

If the ASIC technology vendors were going to support characterized 
cell libraries for a larger voltage range than today, this would also en- 
able the designer to reduce power by taking advantage of a reduction 
in clock frequency for less computationally demanding tasks together 
with aggressive voltage scaling. 



4.4.3 Implementation Area 

Due to the fact that technology scaling continues to follow Moore’s 
law [184] so far without a perceptible deceleration, area consumption is 
gradually getting less important. Nevertheless, area is currently still a 
costly resource, which affects the unit price of an ASIC and has to be 
minimized in order to increase the profit margin for high volumes. 

Area consumption can also be an issue due to the following reason- 
ing: an increase in area increases the average length of interconnections 
(which can be estimated from wire load models [93] [256] used for syn- 
thesis), which in turn decreases the maximum clock frequency. Thus, 
excessive area increase has to be avoided in order to avoid performance 
degradation. To make matters worse, an increase of the average inter- 
connection length also increases the interconnection capacitance result- 
ing in a higher switching power. Long interconnection delays can be 
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avoided by using local communication rather than global communica- 
tion. This means, that the interconnections between hardware blocks 
should be reduced to a minimum in order to save area and energy and 
to enable high computational performance. Properly designed ASIPs 
follow that guideline and use local communication between adjacent 
functional units. 

It should be emphasized that ASIPs are inherently saving area, because 
they represent resource shared architectures^ compared to typical ded- 
icated hardware, which often exposes more parallel processing struc- 
tures. From a hardware perspective, ASIP design can be viewed as a 
design methodology in order to implement resource sharing by extract- 
ing and implementing common operation patterns from a given control 
data flow graph representation of an application. 



4.5 Concluding Remarks 

This chapter has defined important terms for ASIP design as well as 
typical ASIP applications. Furthermore, the design space for ASIPs 
has been characterized with the goal to provide well-defined degrees of 
freedom for the ASIP designer. This characterization enables both the 
ASIP designers together with the compiler designers to decide, whether 
specific architectural features of an ASIP are needed and if compiler 
support for these features is possible. This design process is an impor- 
tant aspect for the design efficiency of ASIPs and is commonly referred 
to as compiler/processor co-design [284]. Finally, this chapter has qual- 
itatively discussed the effects of high level ASIP design decisions on 
the computational performance, the energy consumption and the imple- 
mentation area. This knowledge is needed in order to successfully apply 
the design methodology described in the following chapter to real world 
applications. 



^ Area can however be an issue for massively data or instruction parallel architectures or for ASIPs with 
parallelized accelerator extensions. 
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The ASIP Design Flow 



The design of ASIPs represents a hardware/software eodesign task, be- 
eause hardware and software related expertise is needed in order to get 
optimum results. This eodesign problem ean be eonsidered as a eom- 
plex optimization problem in the multi-dimensional ASIP design spaee, 
whieh has been defined in Chapter 4 and, additionally, also as a soft- 
ware optimization problem. It is the primary goal of most ASIPs design 
to find a programmable arehiteeture that meets the performanee eon- 
straints of an applieation and that eonsumes a minimum in area and 
energy eonsumption. Moreover, this arehiteeture should be suffieiently 
flexible to eope with late design ehanges due to evolving standards or 
ineorreet speeifieations. 

This ehapter provides a eomplete deseription of the proposed ASIP de- 
sign flow', whieh starts with a behavioral deseription of the algorithm 
and ends with the optimized ASIP hard- and software implementation 
ineluding design tools and doeumentation. One important feature of the 
proposed ASIP design flow is the faet, that the ASIP hard- and software 
is in the iteration loop. This enables the designer to jointly optimize 
performanee, silieon area and energy in order to get a feasible imple- 
mentation in a short amount of design time. 

An overview of the entire ASIP design flow is depieted in Figure 5.1 
starting with the behavioral high level language (HLL) speeifieation of 
the applieation. The ASIP design tasks in Figure 5.1 represent a top- 
down design approaeh, whieh enables the designer to eope with eom- 
plexity by using several abstraetions. During the initial applieation pro- 
filing, the abstraetion of high level language statements and operators 
is used in order to eolleet exeeution statisties of the applieation. For 
the initial instruetion set arehiteeture definition, the designer ean use a 
eyele-true deseription of the instruetion behavior, whieh negleets details 

*The examples in this chapter focus on ASIP performance enhancements rather than energy optimiza- 
tions, which simplifies the discussion. Chapter 7 extends this focus by covering the energy consumption 
and the design time of ASIPs using two real-world examples. 
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of the low-level hardware implementation in the first place. Afterwards, 
the implementation of the hardware RTL description uses the same high 
level operators, neglecting the specific logic implementation, which is 
added later on either by explicit definition or by logic synthesis^. These 
abstractions imply, that precise low level estimations can lead to iter- 
ations in the design flow resulting in changes of higher level design 
decisions. 

The design flow in Figure 5.1 starts with an arbitrary application, which 
might contain parts suitable for an ASIP software implementation and 
other parts suitable for a more dedicated hardware implementation. In 
order to identify this partitioning and perform the mapping to hard- and 
software, several design tasks are needed that are beyond the definition 
of an instruction set architecture. Nevertheless, these design tasks are 
important in order to realize an optimum implementation in any case. 

The described ASIP design flow is adaptable to the needs of typical 
applications, which is demonstrated by examples. The examples in this 
chapter should not be regarded as complete case studies, but rather serve 
as vehicles in order to illustrate important design decisions. The se- 
lected example applications have been chosen as a representative subset 
of possible ASIP applications in order to cover relevant DSP applica- 
tions. 

This chapter is organized as follows: In Section 5.1 the example appli- 
cations and example kernels are briefly introduced. Section 5.2 depicts 
the application profiling and partitioning tasks, which are needed prior 
to the actual ASIP design tasks described in Section 5.3. The examples 
in Section 5.2 and Section 5.3 are made standing out using serif font. 



5.1 Example Applications 

In this section the ASIP applications are presented that will be used as 
examples in the next section. These examples have been selected in or- 
der to cover a reasonably large part of the ASIP design space concerning 



^Alternatively, for high speed arithmetic either custom designs or tools like Synopsys’ Module Compiler 
[236] can be used. 
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Figure 5.1: Overview of ASIP Design Tasks 



• complexity of the task (in code lines of a high level language 
(HLL) description) 

• control flow vs. data flow orientation 

• cyclostationary vs. non-cyclostationary data processing 

• high vs. low data locality 

• high vs. low data rate and computational requirements 

• different operator granularity 
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Figure 5.2 classifies the selected applications by complexity using the 
domains of system level applications, complex subtasks, and compu- 
tational kernels. The focus of this thesis is the implementation and 
design methodology for complex subtasks and computational kernels 
rather than the design of complete systems. 



high 




low 



System 

Level 

Applications 

Complex 

Subtask 



Computational 

Kernel 




Figure 5.2: Example ASIP Applications 



The two subtasks which are considered in the following are acquisi- 
tion and tracking for a terrestrial digital video broadcasting receiver 
(DVB-T A&T application, for details refer to Section 7.1) and eigen- 
value decomposition (EVD application^) of a complex hermitian matrix 
which is needed e. g. for direction of arrival estimation [203] and for 
subspace-based channel estimation [26]. 

The DVB-T A&T task initially uses several parameter acquisition 
phases that expose non-cyclostationary processing and finally enters a 
theoretically continuous parameter tracking which is largely cyclosta- 
tionary. Due to the huge number of control parameters of the DVB-T 
A&T tasks, this application represents a mixed control/data flow ori- 
ented application. On the other hand, the EVD uses cyclostationary 
processing, because it represents a data flow oriented architecture that is 
based on the granularity of matrix block processing. Parts of the DVB- 
T A&T task require very high computational performance, which are 
directly determined by the DVB-T transmission time frames, whereas 
other parts only affect the acquisition time of the system. On the other 
hand, the EVD for a direction-of-arrival (DOA) estimation requires high 



^The EVD for the example applications use a matrix size of 10x10. 
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computational performance, because this impacts the number of sup- 
ported mobile users for a mobile base station. 

The different computational kernels that have been selected for illus- 
tration purposes are finite impulse response filters (FIR), fast fourier 
transformation (FFT), and coordinate rotation digital computer compu- 
tations (CORDIC). Descriptions and listings of the high level language 
implementations"^ which have been used as behavioral descriptions for 
these kernels can be found in Appendix B. 

The properties of the selected applications are described in Table 5.1. 
Examples for constraints concerning the data rate are given later on in 
this chapter. It has to be mentioned that several properties in Table 5.1 
depend on the software implementation. For instance, the data locality 
of the FIR depends on the implementation of the delay line for the input 
samples: If this delay line is realized with explicit memory move oper- 
ations, the data locality is medium, whereas if a circular buffer is used, 
the data locality is high. Another example is the SW implementation 
dependent granularity of operators, which can be refined for any appli- 
cation to the granularity of standard word-level or bit- level operators. 





Data 

Locality 


Granularity 
of Operators 


Complexity 
(code lines, 
states) 


Control/ 
Data Flow 
Domination 


Cyclo- 

stationary 

Processing 


FFT 


low 




medium 


data 


yes 


FIR 


medium 
to high 


real scalar 


medium 


data 


yes 


CORDIC 


high 


real scalar 


medium 


mixed 


yes 


EVD 


medium 


complex 

scal./vect. 


medium/ 

high 


mixed 

contr./data 


largely 


DVB- 

A&T 


high 


real scalar 


high 


mixed 

contr./data 


during 

tracking 



Table 5.1: Properties of the Different Selected Example Applications 



'*The key parameters for these kernels are: 64 taps for the FIR, 8192 point for FFT, and 24 iterations for 
the CORDIC task. 
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5.2 Application Profiling and Partitioning 

The proposed design flow of Figure 5.1 starts with a HLL deseription 
of the algorithm, whieh has to be profiled in order to identify eritieal 
parts. Critieality on this level of abstraetion refers to parts of the algo- 
rithm that require high eomputational performanee and/or high memory 
bandwidth. With these results, the partitioning into parts of the appli- 
eation that will later on be mapped to ASIPs, ASIP eoproeessors or 
dedieated hardware ean be performed. 



5.2.1 Stimulus Generation for Application Profiling 

This subseetion deseribes the requirements and issues of stimulus gen- 
eration for applieation profiling. Typieally, either stimulus generation 
from serateh or stimulus reuse and extraction of already available 
stimuli by the system simulations is required to expose the performanee 
eritieal parts of an applieation. This profiling stimulus generation design 
task does not have to provide full simulation eoverage of all arithmetie 
and logieal funetionalities of the referenee and of all internal states (like 
the stimulus generation for verifieation); it rather has to produee the 
worst ease runtime^ of the applieation to obtain realistie profiling re- 
sults for real-time applieations^. A eomparable stimulus generation task 
has to be performed after an initial ASIP implementation is available in 
order to generate worst ease stimuli for arehiteeture profiling. 

Example: Due to the complexity and configurability of the DVB-T A&T ap- 
plication, profiling stimuli could not be generated by the system simulation. 
Table 5.2 depicts the complexity of all the different DVB-T A&T tasks in num- 
ber of C code lines and number of I/O parameters as well as the complexity of 
the SW testbenches. Each of the tasks Pre-FFT-, Post-FFT-Acquisition, and 
Post-FFT-Tracking has about 100 internal state variables: The control flow in 
the application is a function of a smaller subset of these state variables. This 
subset of state variables has been identified and suitable testbenches have 
been manually written in order to execute the critical path in the software for 
worst case runtime. For this complex application with many boundary cases^, 



^Worst case rantime can be excited by requiring the maximum number of operations/data transfers per 
time unit of a latter implementation. 

^For non-real-time applications, the typical runtime can also be used as a metric for this design task. 
^These application typically use deeply nested HLL control statements like e.g. if or switch. 
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a divide and conquer approach has been used, which partitions the application 
and the testbench into smaller pieces, that can be independently profiled*^. 

The initial effort to create these software testbenches adds up to more than 
two man weeks for the DVB-T A&T application, which corresponds to more 
than 5% of the total design time. These testbenches in modified form have 
been reused later on in the design flow for architecture profiling and for the 
verification between the software running on the instruction set simulator and 
the reference software. 



Task 


Behavioral 
Description 
(#ofC 
Code Fines) 


# of I/O 
Parameters 


Testbench 
(# ofC 
Code 
Lines) 


SW Testbench 
Design Time 
(# of Man 
Days) 


Reset 


44 


- 


5 


<0.1 


Pre-FFT- 

Acqu. 


359 


10/3 


275 


5 


Post-FFT- 

Acqu. 


330 


10/2 


283 


4 


Post-FFT- 

Tracking 


397 


4/3 


278 


7 


PEQ 

Estimation 


60 


3/1 


(incl. in 
Tracking) 


“ 



Table 5.2: Design Effort for SW Testbenches (DVB-T A&T) 



5.2.2 Application Profiling 

The purpose of applieation profiling is to loeate the performanee erit- 
ieal parts within an applieation using profiling stimuli for worst ease 
seenarios. Applieation profiling ideally should be target- arehiteeture- 
independent: Profiling at the abstraetion level of operators and memory 
aeeesses with a eertain user-defined data granularity should be preferred 
over measuring the instruetion eount of a real implementation. Obvi- 
ously, for HLL-eode-based applieations, it makes sense to use typieal 
HLL eode operators for profilingand measure the memory aeeesses to 
primitive data struetures (like e.g. one integer element in an integer 
array). 

®For this purpose, the application has been temporarily modified to enhance the controllability. 
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In order to perform this profiling task (or an approximation of this task), 
two basic approaches are possible: operator-based profiling or HLL 
line -based profiling. Table 5.3 shows a comparison of these approaches. 



Approach 


HLL Operator- 
Based 


HLL Line- 
Based 


Tool/Methodology 


instrumented 
HLL code [56] 


e. g. gcov 
[229] 


Design Effort 


high 


low 


Execution Speed 


lower 


high 


Precision of 
Results 


high 


low 



Table 5.3: Comparison of Application Profiling Approaches 

The line-based approach can be readily performed with a coverage tool 
like e.g. gcov [229] with virtually no additional design effort. Actually, 
a coverage tool gcov is supposed to evaluate line coverage information, 
but the output of this tool can obviously also be used for profiling pur- 
poses. This approach enables the designer to obtain an overview of 
critical loops in the application within a very short amount of time. Un- 
fortunately, the accuracy of the results is poor, because the number of 
high level language operators and memory accesses on each HLL line 
varies. 

On the other hand, true operator-based profiling using an instrumented 
HLL code requires to add additional profiling statements in the code, 
which leads to a significant increase in design time. More advanced 
techniques for this design task use overloaded operator instances, which 
update the profiling information as a side effect. For this thesis, how- 
ever, a different application profiling approach is proposed. This pro- 
posed profiling methodology uses the concept of a so-called profiling 
processor as a virtual target architecture together with an optimizing 
HLL compiler. The HLL application is mapped to this profiling proces- 
sor and the instruction/operator count as well as the memory accesses 
are evaluated by an instruction set simulator. It may be argued that this 
concept violates the target architecture independence of the application 
profiling task, because a real instruction set architecture is used as a tar- 
get for profiling. This objection is partially true, but the advantages of 
the proposed methodology outweigh the demerits: 
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• the profiling processor can be taken out of a processor template 
library (PTL^) (together with the HLL compiler and the simula- 
tion tools), which enables a complete automation of the applica- 
tion profiling task 

• additional information (apart from operator count and memory ac- 
cesses) can be obtained by this approach, like data locality, branch 
frequency etc. 

• the results are accurate, in the sense, that the measured instruction 
count on this profiling processor could be implemented with a real 
instance of this processor 

• exactly the same methodology can be used later on for architecture 
profiling by taking the processor template (out of the PTL) for the 
processor class (cf. Subsection 5.2.4) that optimally matches the 
application and by using this processor template as a starting point 
for design space exploration 

The most simple profiling processor implements a so-called basic in- 
struction set with two operand instructions (register/register or imme- 
diate/register), which has been described in [91] and which is reprinted 
in Table 5.4. Two separate flat address spaces for I/O and data memory 
can be accessed by register indirect addressing with displacement. 

This instruction set is only a subset of typical HLL operators, because 
operators like division and modulo operations are usually very expen- 
sive in terms of either silicon area, energy or latency and, consequently, 
have been omitted in this instruction set. Obviously, these or other op- 
erators can be added, if an application extensively needs them. Further- 
more, explicit I/O instructions have been provided in order to profile the 
input/output behavior of the algorithm 

This profiling processor uses a one stage pipeline organization, which 
avoids forwarding paths and corresponds to an instruction true abstrac- 
tion from a real pipelined implementation. The architecture does not 
support instruction level parallelism, but rather provides a scalable gen- 
eral purpose register file optionally with a cache memory hierarchy in 

^Implementations of the software tools and the hardware for different processor classes (processor 
classes can include the examples of Subsection 5.2.4 or the classes defined in Section 4.3) can also be 
stored in this PTL to provide a starting point for the design space exploration and implementation. 

^^These I/O instructions can be supported by compiler-known-functions. 
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Type of 
Instruction 


Instruction 

Mnemonic 


Description 


Load/Store 


RDIO 


read I/O data 




WRIO 


write I/O data 




RDM 


read data memory 




WRM 


write data memory 


arithmetic 


ABS 


absolute value 




ADD/ADDI 


addition 




MOV/MOVI 


data move 




MULU/MULS 


signed/unsigned multiplication 




SHL/SHLI 


arith./logic. shift left 




SRA/SRAI 


arith. shift right 




SUB/SUBI 


subtraction 


logic 


AND/ANDI 


bitwise AND 




OR/ORI 


bitwise OR 




SRL/SRLI 


logic, shift right 




XOR/XORI 


bitwise XOR 


control 


CMP/CMPI 


compare/set status 




BRA 


uncond. branch 




BSR 


branch to subroutine 




BEQ/BNE 


branch if equal/not equal 




BLT/BLE 


branch if less than/less or equal 




BGT/BGE 


branch if greater than/ 
greater or equal 




END 


exception/transition to idle mode 




RTS 


return from subroutine 



Table 5.4: Basic Instruction Set for Profiling 



order to measure the data locality of data flow intensive algorithms. The 
HLL compiler for this profiling processor has been implemented with 
the COSY Compiler Development System [3]. 

Example: Figure 5.3 shows the result of HLL line-based profiling for the DVB- 
T A&T application, where loop kernels with a significant number of iterations 
can be clearly identified. This profiling run uses realistic stimuli, which reflect 
the different states of the processor, namely a relative short period for the ac- 
quisition of different parameters and afterwards the (theoretically continuous) 
tracking operation. The large range in execution frequency is the reason that 
a logarithmic scale on the vertical axis in Figure 5.3 has been chosen. 
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Figure 5.3: HLL Line Coverage Profile (DVB-T A&T example) 



The above described profiling methodology using the profiling processor has 
also been applied to the DVB-T A&T application. Figure 5.4 shows the profil- 
ing results for the assembler implementation with the same stimuli that have 
been used for Figure 5.3. The similarity of the two visualized profiling graphs 
is obvious, which is due to a close match between the compiler generated as- 
sembly implementation and the reference HLL implementation. However, the 
scale of the horizontal axis clearly shows, that several assembly instructions 
per line of HLL code have been executed. Provided that it is possible to imple- 
ment a profiling processor that achieves one instruction execution per cycle 
for a given clock period constraint, the vertical axis in Figure 5.4 corresponds 
to clock cycles in the system. In this case, the (non-logarithmic) area under 
the graph for a certain address range in the program memory is proportional 
to the runtime which is spend in this part of the program. 

In order to assess the performance criticality of the application, quantitative 
data are needed, namely the ratio between worst case runtime of the profil- 
ing implementation and the maximum runtime constraint of the application. 
Obviously, this ratio should be smaller than 1 .0 in order to obtain a feasible 
implementation. Table 5.5 depicts the performance evaluation results for the 
example applications under the assumption that the profiling processor has 
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Figure 5.4: Assembler Line Coverage Profile (DVB-T A&T profiling impl.) 



32 general purpose registers and is running at the system clock frequency of 
the DVB-T system. The cycle constraints of all the computational tasks are 
violated in Table 5.5, which motivates optimizations of the processor architec- 
ture implementation. 



Application 


Worst Case 
Cycle Count N^jc 
(on Profiling 
Processor) 


Cycle Count 
Constraint 

^max 




Pre-FFT- Acquisition 


7077 


4096 


1.73 


Post-FFT- Acquisition 


5661 


4096 


1.38 


Post-FFT-Tracking 


6208 


1024 


6.06 


PEQ Phase Estimation 


1353 


192 


7.05 



Table 5.5: Cycle Count of Profiling Implementation vs. Max. Cycle Constraints 



For the sake of conciseness we limit the following discussion to the most criti- 
cal DVB-T A&T tasks. After further examination of the Post-FFT-Tracking and 
PEQ Phase Estimation task, it was found, that these tasks make intensive use 
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of a common CORDIC subroutine, which has been marked in Figure 5.4. Fig- 
ure 5.5 illustrates the significant runtime, which is consumed in the CORDIC 
subtask with respect to the total runtime in each case. Consequently, opti- 
mization of the CORDIC subtask (possibly among other tasks) is needed in 
order to meet the cycle constraints for these critical computational tasks. 



CORDIC Subprogram 




CORDIC Subprogram 


1^^ PEQ Caller 




1 1 Post-FFT-Trk.-Caller 




96 % 

Complete PEQ Phase Estimation Task 




Complete Post-FFT-Tracking Task 



40 % 



Figure 5.5: Percentage of Runtime used for the CORDIC Subtask 



5.2.3 HW/SW Partitioning 

HW/SW partitioning is a prerequisite to the aetual ASIP design flow, 
whieh has to make sure that an instruetion set oriented ASIP is a rea- 
sonable implementation for a given applieation or whether parts of the 
applieation are better mapped to eoproeessors or dedieated hardware 
bloeks. This deeision is important for the energy-effleieney of the sys- 
tem, beeause the flexibility of ASIPs implies a higher energy eonsump- 
tion as demonstrated in Seetion 7.1. The logieal eonsequenee of this 
faet is that the high flexibility of ASIPs should only be used for tasks 
that require and take advantage of it. On the other hand, tasks with high 
eomputational requirements that do not need mueh flexibility should 
rather be mapped to a dedieated strueture in order to take advantage of 
the higher energy-effleieney. 

Partitioning of the applieation into bloeks of a reasonable size is fol- 
lowed by the selection of the hardware class which maps each of these 
blocks to either a software implementation (e.g. ASIP, microcontroller, 
general purpose processor), to a tightly coupled ASIP coprocessor or 
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to dedicated hardware. This selection task is primarily controlled by 
the estimated performance of the target hardware. For many tasks with 
sufficiently low time constraints, however, both hardware and software 
implementations are possible. In such a case, additional parameters like 
energy-efficiency and flexibility have to be considered for this selection, 
which is illustrated in Figure 5.6 

The difficulty of this task is the fact that the estimates on this level 
of abstraction tend to be significantly imprecise, because the details of 
the target implementation are still unknown. For instance, in case of 
an ASIP implementation, the designer might have a coarse idea of the 
ASIP instruction count and the coarse ASIP structure, however, he is 
unaware of the critical path, the area and the power consumption of the 
implementation. This issue calls for a design methodology that provides 
a short path to implementation. 



Flexibility 



FlexibiMy 




ASIP Tetrahedron ASIP + ASIP Coprocessor Hexahedron 

Figure 5 . 6 : Design Space for ASIPs and ASIP Coprocessors 



Example: For the above-mentioned CORDIC subroutine of the DVB-T A&T 
application, the selection of a hardware or a software implementation is not 
straightforward. On the one hand, the CORDIC requires high computational 
performance, which could be efficiently mapped to an energy-efficient ASIP 
coprocessor [86]. On the other hand, this coprocessor needs additional hard- 
ware resources, like shifters, adders and memories if implemented as a sep- 
arate entity. Most of these resources are needed in the ASIP anyway and can 
therefore be resource shared. Furthermore, in the case of late design changes 
after silicon fabrication, the dedicated coprocessor needs a full redesign of the 
chip, whereas a software implementation typically needs only a redesign of the 

"The remaining tetrahedron of the cube which is omitted in Figure 5.6 is the design space of dedicated 
hardware 
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program'^. For the sake of increased flexibility, the CORDIC task in the DVB-T 
A&T application has been mapped to an optimized software implementation. 
Finally, in case of the EVD which requires higher computational performance 
than the DVB-T A&T application the CORDIC has been mapped to an ASIP 
accelerator. 



5.2.4 ASIP Class Selection 

The design space for ASIPs has already been described in Chapter 4. 
HLL support for complex applications is typically indispensable to 
avoid error-prone and tedious assembler coding. Provided that a HLL 
compiler or a compiler design environment (e.g. Cosy [3], Chess [156] 
or RECORD [165]) is required and available during the ASIP design, 
the compiler-supported subset of ASIP classes within this large design 
space has to be identified as a starting point for the following selection. 
Alternatively, in the absence of a compiler, the ASIP class has to be 
selected in order to facilitate manual assembler programming. Never- 
theless, this is only possible for less complex applications in order to 
avoid tedious and error prone programming design tasks. 

The task of finding the processor instruction set architecture (ISA) class 
that represents the best match to the application is of paramount impor- 
tance for the design efficiency. This selection also affects the verifica- 
tion effort of the final hard- and software, which represents a major part 
of the overall design time according to Appendix F. 

In the following discussion, several examples serve as illustration for 
the selection of important parameters for a suitable processor class. This 
discussion is not yet a complete commitment to a specific ISA. It rather 
determines a good starting point for the following ASIP optimization 
tasks. 

Non-Parallel vs. Instruction/Data Level Parallel ASIP: A certain 
task can be implemented using a scalar (single instruction issue) 
ASIP, provided that the cycle count of the software implemen- 
tation for the profiling processor in Subsection 5.2.2 is smaller 

*^In case of the DVB-T A&T application, this redesign of the program requires a redesign of the in- 
struction ROM masks for the chip fabrication at reduced costs compared to a full chip redesign. For other 
applications, which use on-chip RAMs as instruction memories, a redesign of the ASIP software does not 
affect the chip fabrication costs. 
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than the cycle constraint with a certain cycle safety margin that 
accounts for the overhead of a real processor implementation. 
If this condition is not fulfilled, there are two options: either 
the ASIP has to be implemented as a classical data or instruc- 
tion level parallel architecture e.g. using a SIMD or a VLIW 
implementation. Alternatively, the designer has to optimize the 
scalar ASIP instruction set by implementing additional optimized 
instructions that are able to perform several frequently needed 
operations in parallel during one clock cycle, which results in a 
cycle count reduction. This kind of optimization is a typical ASIP 
optimization, which also enhances the energy-efficiency of the 
implementation (refer to Section 7.1 for details). 

Example: The choice between VLIW ASIP or optimized non-parallel 
ASIP significantly affects the design time and the area/energy-efficiency 
of the implementation. For the DVB-T A&T application, a scalar, single 
issue processor implementation has been chosen, in order to reduce the 
design time and to obtain the best possible energy- and area-efficiency. 
This choice requires optimization of the implementation using special- 
ized instructions according to the results of application profiling. For the 
EVD an architecture that represents a mixture between a pure scalar 
and a SIMD architecture has been selected in order to speed up the 
processing of vector and matrix operations. 

Organization/Access of Storage Elements: The memory and ASIP- 
internal register structure has to be organized in order to speed 
up the common and critical tasks of the application. This re- 
quires small and fast scratch-pad memories together with reason- 
ably sized internal registers and register files. For applications 
with a high ratio Ri/s of load/store-operations Ni/s to the total 
number of executed operations Ntot the use of a pure load/store 
architecture with a central register file is disadvantageous. In 
such a case a memory-memory architecture or a heterogeneous 
architecture (which uses memory-memory instructions together 
with load/store instructions) is better in order to reduce the num- 
ber of explicit load/store instructions, which need energy in the 
fetch/decode stage and result in a large footprint in the instruction 
memory. Furthermore, the significant energy overhead of writing 
and reading the general purpose register can be avoided. 

Example: Table 5.6 shows the frequency of load/store instructions of the 
profiling implementation for the considered example applications. The 




5.2. Application Profiling and Partitioning 



91 



DVB-T A&T application can be readily implemented using a reasonabiy 
smail, fiat data memory in combination with a load/store architecture due 
to the small ratio Ri/s- However, the FFT has a large ratio Ri/s, which 
suggests a memory-memory architecture or alternatively, a heteroge- 
neous architecture with optimized memory-memory instructions like in 
[138]. A more detailed application analysis for the EVD reveals, that it 
is indeed possible to take advantage of a large general purpose register 
file together with additional address registers in order to exploit data lo- 
cality, which also suggest a load/store architecture. For the current FIR 
implementation, a memory-memory architecture is one option. However, 
a small change in the FIR high level description'^ using a circular buffer 
for the delay line can avoid about 50% of the memory accesses, which 
makes a load/store architecture also a feasible alternative. 



Application 


Percentage of 
Load/Store Operations 
(Ri/, • 100%) 


DVB-T A&T 


3.5% 


CORDIC 


4.8% 


EVD 


17.4% 


FIR 


28.7% 


FFT 


36.8% 



Table 5.6: Percentage of Load/Store Instructions (Profiling Implementation) 



Pipeline: Depending on the clock period constraint of the system, the 
depth of the pipeline has to be chosen in order to meet this con- 
straint. For a high operating frequency a long pipeline is neces- 
sary. On the other hand, a longer pipeline tends to expose a larger 
branch penalty for taken branches This is especially disadvan- 
tageous for tasks with a large ratio Rbranch of taken branch opera- 
tions Nbranch to the total number of executed operations Ntot- 

Example: In Table 5.7 the percentage of taken branch operations dur- 
ing program execution of the example applications is depicted. For the 
CORDIC, the FFT and the EVD this percentage is negligible, because 
a more thorough investigation of these applications reveals, that these 
branches are mostly used to implement loops which can be replaced 
by zero-overhead loop instructions. On the other hand, for the DVB-T 

*^The current description has been taken from the DSPstone kernel collection [285] 

'“^For this analysis a general purpose register file with 32 general purpose registers has been used. 

*^This penalty model assumes a predict-untaken branch execution scheme [107] without executing in- 
stmctions in the branch delay slots. 



















92 



Chapter 5. The ASIP Design Flow 



A&T application the significant number of taken branches reflects the 
property of a typical mixed control/data fiow application. This high per- 
centage motivates the use of a short pipeline in order to mitigate the 
overall branch penalty. 

Granularity of Functional Units: For a good match between architec- 
ture and application, the granularity of the data that are processed 
in the functional units of an ASIP should reflect the data types in 
the application. 

Example: The DVB-T A&T application mostly uses scalar data with a 
word length between 16 and 32 bit. Consequently, most of the functional 
units of the core use 32 bit operations. On the other hand, the FFT as 
well as the EVD use complex data of different bit widths requiring com- 
plex arithmetic operations (additions, subtractions, multiplications and 
shifts) in the functional units, which have to be designed for the maxi- 
mum needed bit widths in each case. 



Application 


Percentage of 
Taken Branch 

Operations (Rbranch • 100%) 


Note on branch 
characteristics 


DVB-T A&T 


9.3% 


mostly data dependent 


CORDIC 


10.7% 


mostly needed for loops 
with data indep. 
iteration count'® 


FIR 


7.1% 


same as above'® 


EVD 


4.1% 


same as above'® 


FFT 


2.1% 


same as above'® 



Table 5.7: Percentage of Taken Conditional Branch Instructions (Prof. Impl.) 



The bottom line of this seetion is that aeeording to the above-mentioned 
quantitative properties of an applieation (instruetion level parallelism, 
loeality, eontrol vs. data flow dominanee, granularity), the task of ASIP 
elass seleetion ean be formalized for many different applieations. This 
faeilitates and speeds up this important design task inereasing the design 
effieieney. 

An orthogonal issue to the aspeets mentioned above is verifiability of 
the proeessor hard- and software. The verifieation tasks and the im- 

^^The majority of these branches can be realized by zero-overhead loop control instructions which re- 
duces the overall branch penalty to less than 2% in all these cases. 
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pact of architectural decisions on the verification effort is discussed in 
Section 5.4. 



5.3 Combined ASIP HW/SW Synthesis and Profiling 

The selection of the processor class in the previous subsection does not 
define an accurate point in the multi-dimensional ASIP design space. 
It rather provides a starting point for further optimizations using con- 
straints imposed by the application and the programmability. Thus, 
there are still many open design issues, which need to be explored and 
optimized like 

• organization and number of internal registers as well as memories 

• behavior/coding of instructions, addressing modes etc. 

• detailed pipeline organization (e.g. forwarding and bypassing) and 
mapping of operators to pipeline stages and functional units 

• control of core operation (e.g. for startup, reset and IRQ process- 
ing) 

• pipeline control policy (e.g. for branches and wait states) 

• interface implementation 

• optionally: coprocessor structure and implementation 

Figure 5.7 shows the proposed design flow, which covers all the above- 
mentioned issues in order to find a feasible instruction set architecture 
and implementation. The main difference between the proposed design 
flow and previously published ASIP design approaches is that the hard- 
ware implementation in the proposed flow is in the iteration loop. This 
has significant consequences: On the one hand, this requires the de- 
signer to maintain several consistent descriptions of the ASIP instruc- 
tion set architecture^^ (and the interfaces and coprocessors) namely for 
the instruction set simulator (or the system simulator, which includes 

^^This issue will be solved in Chapter 6 by a processor description language and advanced design tools 
which require only a single description of the ASIP. 
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interfaces and coprocessors), the software design tools and for the hard- 
ware implementation in form of a HDL. On the other hand, this method- 
ology makes sure that the actual implementation is really able to meet 
the application constraints for cycle time, area and energy. In other 
words, this methodology enables the designer to optimize high level pa- 
rameters like the cycle or instruction count of an implementation, while 
being able to track the effect of high level optimizations on the low 
level parameters cycle time, area and energy consumption. This results 
in less iterations and in a faster design time, provided that this design 
flow can be automated to a large extent. 

In the following, the different subtasks of Figure 5.7 will be described 
with a focus on the ISA definition and on the iterative ISA optimiza- 
tion. This discussion also defines the requirements for the design tools 
of Chapter 6, which are needed to support and automate the proposed 
design flow of this thesis. 



5.3.1 ASIP Interface Definition 

Off-the-shelf processors in form of packaged chips or hard-macros are 
unable to adapt their interfaces to the external world. Synthesizable 
ASIPs, however, can easily be integrated into a system-on-a-chip, that 
requires a proprietary interface behavior. From a system perspective, 
a black box that is implemented by an ASIP can be regarded as an or- 
dinary hardware block with typical hardware interface characteristics. 
Thus, the required ASIP interface mechanisms have to be negotiated 
between different designers or design teams. As a conventional pro- 
cessor is not able to handle fast streams of input data efficiently due to 
task switching overhead and instruction overhead in order to read the 
data from the input port and to store them, either specialized instruc- 
tions or dedicated I/O coprocessors have to be used. For performance 
critical tasks with little runtime headroom for I/O operations, a DMA 
controller with double buffering is an option, which enables the ASIP 
core to focus on computations rather than on FO activity. The detailed 
interface implementation can be subject to iterative refinement during 
the ASIP design flow in Figure 5.7. 

Example: For the DVB-T A&T application a dedicated I/O processor together 
with instructions to support synchronization channels is needed in order to 
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from higher level design tasks (cf. Figure 5.1) 




continue with verification and 
documentation 



Figure 5.7: Combined ASIP HW/SW Synthesis and Profiling 
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meet the interface constraints of the system environment. This implementation 
enables simultaneous I/O operations together with normal data processing in 
the ASIP core, effectively decreasing the cycle count for critical tasks. For the 
EVD, double-buffering is advantageous, because of the larger amount of data 
that have to be transferred for each iteration. 



5.3.2 ASIP ISA Definition 

The task of defining the instruetion set arehiteeture ean be subdivided 
into at first the definition of the processor architecture and afterwards 
the definition of the processor instructions (ISA). This distinction is 
common in the technical literature [132]. 

Due to significant mutual dependencies between the processor archi- 
tecture and the supported instructions, this thesis advocates to combine 
these definition tasks in one design step. The complexity of this task can 
be handled using a so-called iterative design technique [113] yielding 
a highly flexible and reusable instruction set^^. This iterative technique 
uses an initial architecture as a starting point which has to be either 

• defined from scratch by the designer (which obviously requires 
significant additional design time) 

• or taken out of a processor template library (PTL) of predefined 
ASIPs (which needs virtually no additional design time^^) 

The library-based approach is able to increase the design productivity 
and it has the additional merit of being able to provide predefined, ver- 
ified, and well-optimized software design tools (e.g. HLL compilers) 
and reference design descriptions (e.g. low-power HDL descriptions). 
These tools and descriptions can be directly used for non-critical ap- 
plications in order to obtain production quality results in a very short 
amount of time. 

The optimization of a given library-based ASIP in order to meet the con- 
straints of an application has to be based on the results of a worst case 

*®In contrast to so-called constructive techniques, which instantiate only the minimum needed amount of 
resources and instructions and, thus, provide a less flexible implementation. 

^^Apart from the design time which has to be spend on the design of library ASIP templates and tools 
by an EDA company. A similar concept is implemented in commercial logic synthesis tools like Synopsys’ 
DesignCompiler with the DesignWare library for word-level arithmetic and logic units. 
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runtime analysis using the estimated low-level parameters after logie 
synthesis. As a funetion of the violated eonstraint(s), one or several of 
the teehniques that have been proposed in Seetion 4.4 have to be applied 
in order to optimize the applieation. 



5.3.3 Software Implementation and Tools 

The implementation of the ASIP software requires programming tools 
like HLL eompiler, assembler, linker as well as instruetion set simu- 
lator. The implementation of these software design tools is a tedious 
and error-prone design task, beeause they have to be eonsistent with 
the ISA, eoproeessor and interfaee definitions. In order to effieiently 
explore a large design spaee, the iteration time for the design loop in 
Figure 5.7 should be reasonably small. The design environment [109] 
that has been developed at the Institute for Integrated Signal Proeessing 
Systems, whieh is briefly reviewed in Chapter 6, automates the genera- 
tion of these design tools. 

Provided that these tools are available for a eertain ASIP arehiteeture, 
the software design flow is straightforward and partially eomparable 
to eommereial software design flows for off-the-shelf proeessors. Dif- 
ferenees to eommereial environments are due to applieation- speeifie 
instruetion set features, aeeelerators and speeialized interfaees, whieh 
have to be supported by the programming tools. Furthermore, typieal 
ASIPs are used in embedded systems-on-ehip that require a eombina- 
tion of high eomputational performanee together with a high energy- 
effieieney. These design goals ean only be reaehed, if the ASIP arehi- 
teeture and the ASIP software are jointly optimized. 

Optimization of runtime of the ASIP software has to take eare to fully 
exploit the applieation-speeifie features of the ASIP, whieh have been 
implemented to mateh the performanee eritieal parts of an applieation. 
These eritieal software parts may have to be iteratively hand-optimized 
to make effieient use of gradually more speeialized hardware until the 
performanee eonstraints are met. The remaining part of the software 
that is often less performanee eritieal should nevertheless exhibit exeel- 
lent eode quality in order to avoid unneeessary deterioration in overall 




98 



Chapter 5. The ASIP Design Flow 



runtime. Typical optimizations of runtime include (but are not limited 
to): 



• avoiding redundant operations e.g. by constant propagation, com- 
mon subtree removal etc. 

• using dedicated instructions in order to speed up loop processing 
[159] 

• reducing memory accesses e.g. by optimally using the available 
registers, register pipelining [230] 

• implementing function calls by exchanging values in registers 
rather than using the (memory) stack 

• avoiding poor schedules by considering dynamic profile data (e.g. 
in the case of mutually exclusive “case” selections in C code, the 
selection with the highest probability should be evaluated first) 

Optimization of energy consumption is achieved by using optimized 
hardware architectures together with a software implementation that ef- 
ficiently uses the hardware. In Section 4.4 the term intrinsic energy and 
the term overhead energy have been defined. Provided that the intrinsic 
energy of a task is significantly smaller than the overhead energy of a 
processor (which is a typical case for processors), any kind of software 
optimization that reduces the runtime (without increasing the overhead 
energy) is also lowering the energy consumption of the processor. Soft- 
ware performance optimization corresponds to processor energy opti- 
mization for many typical scenarios, which completely agrees to the 
results of Tiwari [249]. 

Optimization of energy consumption can be generally achieved by e.g. 

• using specialized instructions that enable multiple operations per 
instruction 

• exploiting doze/sleep modes, which disable/switch off the clock 
distribution/generation 

• exploiting memory hierarchy e.g. by using small, low-power 
scratch pad memories [175] 
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• replacement of power greedy operations like multiplications, di- 
visions and modulo operations for constant R-values of a power 
of two with more simple shift and logic operations (this strength 
reduction typically results only in small benefits (if any) due to the 
high amount of overhead power associated with each instruction) 

• instruction selection based on the average energy consumed by an 
instruction pattern [252] [171] (with typically a very small benefit 
for the same reason than above) 

• using a coprocessor 

The above mentioned techniques can be partially integrated in the HLL 
compiler, but they are also useful for manual optimizations. 



5.3.4 Hardware Implementation and Logic Synthesis 

Estimation of low-level hardware parameters like the maximum clock 
speed is needed in order to guarantee the feasibility of an ASIP archi- 
tecture w.r.t. the given application constraints as well as to track the 
effect of high level decisions during ASIP optimization. Thus, the de- 
signer has to maintain the consistency of yet another even more detailed 
description of the ASIP in form a HDL. 

In Chapter 5.7 a tool which partially automates the generation of this 
ASIP HDL description is presented. This tool is able to generate a de- 
tailed hardware description of the decoder from an abstract operator- 
based instruction set implementation, which results in a significant 
speed up in design time. Further work on hardware description gen- 
eration has been published in [218]. 

For a complete discussion of the critical factors concerning ASIP hard- 
ware implementations refer to Section 4.4. 

Example: The following examples illustrate the necessity to obtain precise es- 
timates for the hardware implementation during ASIP optimization. Logic syn- 
thesis of an initial 2 stage processor pipeline implementation for the DVB-T 
A&T application has indicated a maximum operating frequency that has vio- 
lated the constraints. A complete redesign of the hardware implementation 
using a 3 stage pipeline has solved this problem. Another issue was the im- 
plementation of the general purpose register file, which initially has produced 
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an excessive area consumption in combination with an unacceptable synthe- 
sis time. A redesign of the register file using a more structural hand-optimized 
HDL description (cf. Appendix E.1) has solved this problem reducing the com- 
binational register file area by about 50%. Several times a redesign of op- 
timized instructions using operator chaining has been necessary in order to 
meet the required operating frequency. There are many other examples for 
iterations that have been triggered by low level constraint violations. This de- 
sign methodology is in analogy to best practice ASIC design flows [48], which 
regularly reiterate logic synthesis for estimation purposes. 



5.3.5 Implementation Profiling and Worst Case Runtime Analysis 

This task profiles the current HW/SW implementation considering the 
ASIP SW, the ASIP instruction set, the coprocessor and the interface 
behavior. The result of this implementation profiling is - in analogy 
to application profiling - the cycle count for entire tasks including I/O 
cycles, which has to be compared to the given constraints. After this 
step, the designer is aware of the critical tasks for the current imple- 
mentation. Similar to the methodology used for application profiling, 
the critical kernels in the application have to be identified as a prerequi- 
site for subsequent optimization. 

Profiling of an implementation has to determine the worst case cycle 
count in order to provide an upper bound, which has to be compared to 
the cycle constraints of the application. This worst case cycle count can 
be determined either by 

• simulation using stimuli, which yield this worst case condition 

• (either manual or tool supported) static analysis of the code, which 
is especially difficult in case of many control instructions and/or 
computed branches 

The principle of static cycle count analysis of assembler code is illus- 
trated in Figure 5.8, which depicts the assembler implementation of a 
simple conditional //-instruction. The worst case cycle count for the 
implementation is given by the longest path through the assembler pro- 
gram which is in this case 

Cif_else,wc — PTax{Ci-\-Cht~\~CiF-\-Ch-, C*1 + (5.1) 
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HLL implementation 



Assembler implementation 



± 

PROG_start: 

if (conditionaLexpression) { 
IF_block: 
statementjl; 
statement_l2; 

statement_ln; 

} 

else { 

ELSE_block: 
statement_E1 ; 
statement_E2; 

statement_En; 

1 

CONT_block: 





where 




denotes a cylce count of cycles 



Figure 5.8: Principle of Static Worst Case Cycle Count Analysis 



In the case of uncorrelated forward branches this analysis is trivial. In 
contrast, if (conditional) backward branches are present e.g. in order to 
implement HLL loop statements like while ox for, this analysis is more 
complicated. In these cases, the designer has to determine the maximum 
possible number of loop iterations'^, which have to be annotated to the 
backward branch(es) for a worst case cycle analysis. For correlations 
between forward branches, the analysis in Figure 5.8 yields a possibly 
pessimistic upper bound in runtime. This issue is covered in more depth 
by Boriello [32] and Li [169]. 

Example for the application of the two analysis methodologies: In case of the 
DVB-T A&T application, simulation of typical operating scenarios has been 
used in the first place, in order to determine typical cycle counts. Moreover, 
a subsequent static cycle count analysis of this highly branch-intensive appli- 
cation has been manuaily performed to guarantee that maximum cycle con- 
straints are met in any case. This static analysis guarantees that each path 
through the assembier program is covered, which is difficult to make sure with 



^®This maximum number of loop iterations can usually be derived from the HLL specification. 
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simulation. This task is typically less difficult for applications that are more 
data flow oriented like the EVD. 

Example for profiling results: The profiling of an intermediate ASIP implemen- 
tation for the EVD application yields a worst case cycle count of about 64 000 
machine cycles. Figure 5.9 depicts the visualized assembler code coverage 
for this intermediate implementation. The basic blocks with the highest execu- 
tion frequency contributes 31 ,6% to the overall runtime and has been denoted 
as Critical Block in Figure 5.9. This critical kernel will be optimized in the next 
example in order to reduce the overall runtime. 




Figure 5.9: Visualized EVD Assembler Coverage (Intermediate Impl.) 



5.3.6 Iterative ASIP Optimization 

This optimization task primarily focuses on enhancements of the com- 
putational performance of a given ASIP architecture. However, archi- 
tectural optimization can also be used in order to lower the energy con- 
sumption of an ASIP, which is demonstrated in Section 7.1. 
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According to Section 4.4.1 there are several options in order to inerease 
the eomputational performanee for eritieal tasks. 

• ehaining of operators 

• parallelization of operators 

whieh are eontrolled by either 

• a multi-issue arehiteeture 

• or speeialized instruetions in a single-issue arehiteeture eaeh of 
them with the ability to eontrol several parallel operations with 
just one instruetion 

Example (cont’d): A closer look at the critical computational loop in Figure 5.9 
reveals, that the data flow graph (DFG) depicted in Figure 5.10 is executed 
with each loop iteration. Each of the circles in Figure 5.10 denotes an op- 
eration on complex data. The load and store operations have already been 
optimized by implementing a dedicated address generation, which enables 
the access of row- and column-indexed matrix data. The row- and column 
values are updated by the zero-overhead loop control logic. The assembler 
implementation of this kernel for the considered single-issue ASIP architec- 
ture is given in Listing 5.1. This software implementation uses 12 assembler 
instructions in the loop kernel corresponding to 12 machine cycles per loop 
iteration. The instruction at the beginning of this loop (LPINI) is used to enable 
zero-overhead loop control for the innermost loop by incrementing a special 
register after each loop iteration. Afterwards, if the end value of this loop is not 
yet reached, a branch back to the loop start without a delay cycle is performed. 

For the DFG of Figure 5.10 optimized instructions are defined in the foliow- 
ing in order to speed up the loop processing. In this case, these specialized 
instructions have to meet additional constraints of the implementation: 

• only one memory port is available (which enables only one memory read 
or alternatively, one memory write of complex word per cycle) 

• chaining of operators is impossible due to the maximum required oper- 
ating frequency 

• the number of operator instances and registers for temporary storage 
has to be minimized 

This task of scheduling loop operations can be solved with software pipelin- 
ing [1 55] e.g. using a technique like modulo-scheduling [21 Oj. For this simple 
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Figure 5.10: DFG of Critical Loop Kernel 



example an optimum schedule and register assignment can be manually de- 
termined. 

The fact that only one memory access per cycle is possible, results in a lower 
bound of 4 instructions for the loop body. Table 5.8 shows a possible schedule 
for the operations Ml to M4, A1, A2, L1, L2, S1 and S2 of Figure 5.10, which 
reaches this lower bound. The argument (n) refers to the processing of data 
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.define EVRLOOP_CNTR R3 

LPINI ( EVRLOOP_START_LB , EVRLOOP_END_LB , EVRLOOP_CNTR , N_1 , 3 ) ; 
EVRLOOP_START_LB : 

.undef N_1 R5 
.define L_FREG FR4 
.define R_FREG FR5 

// load left and right columns 

FRRLD (EV_M, EVRLOOP_CNTR, ODRLOOP_CNTR, N, L_FREG) ; 

FRRLD (EV_M, EVRLOOP_CNTR, ODCLOOP_CNTR, N, R_FREG) ; 

// calculate and update left EV column 
.define TMP_LEFT_COL_FREG FR6 

FMOV {L_FREG, 2 , TMP_LEFT_COL_FREG, 2 ) ; 

// TMP_LEFT_COL_FREG = ul*EV[r] [piv_row] 

FMUL {UL_FREG, TMP_LEFT_COL_FREG) / 

.define TMP2_FREG FR7 

FMOV { R_FREG , 2 , TMP2_FREG , 2 ) ; 

FMUL {LL_FREG, TMP2_FREG) / 

// TMP_LEFT_COL_FREG += ll*EV[r] [piv_col] 

FADD (TMP2_FREG, TMP_LEFT_COL_FREG) / 

.undef TMP2_FREG FR7 

FRRST (EV_M, EVRLOOP_CNTR, ODRLOOP_CNTR, N, TMP_LEFT_COL_FREG) ; 
.undef TMP_LEFT_COL_FREG FR6 

// calculate and update right EV column 
FMUL {UR_FREG, L_FREG) ; 

.undef R_FREG R5 

.define TMP_RIGHT_COL_FREG FR5 

// TMP_RIGHT_COL_FREG = lr*EV[r] [piv_col] 

FMUL (LR_FREG, TMP_RIGHT_COL_FREG) ; 

// TMP_RIGHT_COL_FREG += ur*EV[r] [piv_row] 

FADD {L_FREG, TMP_RIGHT_COL_FREG) / 

FRRST (EV_M, EVRLOOP_CNTR, ODCLOOP_CNTR, N, TMP_RIGHT_COL_FREG) ; 
.undef TMP_RIGHT_COL_FREG FR5 
.undef L_FREG FR4 
EVRLOOP END LB: 



Listing 5.1: Initial Assembler Loop Implementation of Critical Kernel 



in DFG iteration n, whereas the argument (n+1) means, that already data for 
the next DFG iteration are processed. 



Cycle 


Loop 

Instruction 


Multiplier 


Adder 


Memory 


0 


PAR.INSN1 


M3(n) 


Al(n) 


Ll(n+1) 


1 


PAR.INSN2 


M4(n) 


- 


Sl(n) 


2 


PAR.INSN3 


Ml(n+1) 


A2(n) 


L2(n+1) 


3 


PAR.INSN4 


M2(n+1) 


- 


S2(n) 



Table 5.8: One Possible Loop Schedule for the Critical Loop Body 
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In analogy to the methodology used by HLL compilers [53] the edges in Fig- 
ure 5.10 correspond to virtual registers which reflect the lifetime of these val- 
ues in real registers. Table 5.9 depicts the lifetime of these virtual registers for 
the data values which are associated with the iteration number n of the DFG. 
It can clearly be observed from Table 5.9, that the processing for one DFG 
iteration is pipelined, with a latency of 8 cycles and a throughpuf ' of 4 cycles. 



Loop 

Counter 


Loop 

Cycle 


VRl 


VR2 


VR3 


VR4 


VR5 


VR6 


VR7 


VR8 
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Table 5.9: Lifetime of Virtual Registers for one DFG Iteration 



These virtual registers have to be assigned to real registers. Table 5.10 shows 
a possible register assignment using 8 data registers. 



Register Nr. 


0 


1 


2 


3 


4 


5 


6 


7 


Cycle 0 


UL 


LL 


UR 


LR 


*VR1 


*VR2 


*VR3 


*VR4 


Cycle 1 


UL 


LL 


UR 


LR 


VRl 


VR2 


VR5 


VR6 


Cycle 2 


UL 


LL 


UR 


LR 


VRl 


- 


VR7 


VR6 


Cycle 3 


UL 


LL 


UR 


LR 


VRl 


VR2 


VR3 


VR8 



Table 5.10: Register Allocation for the Critical Loop Body 



Note that the virtual registers VR1 to VR4 that have a ’*’-prefix in Figure 5.10 
are not produced in the actual but in the previous instruction loop iteration^^. 
This implies that for the first loop iteration these values have to be precalcu- 
lated and moved into the required real registers, which corresponds to the 
pipeline prologue for software pipelined VLIW-machines [7]. Also note that 
after the last DFG iteration two loads LI (n-rl ) and L2(n-n1 ) as well as two mul- 
tiplications Ml (n-rl ) and M2(n-Hl ) are superfiuous, because they belong to the 

^*In this case, the throughput is the important aspect that has to be optimized in order to reduce the 
runtime. The result of 4 cycles per iteration has been obtained by neglecting the overhead due to prologue 
and epilogue, as well as the overhead for the loop initialization. This overhead is certainly small, because 
the number of iterations is large. 

^^Here, the term instruction loop iteration refers to one iteration of the loop that uses the optimized 
instructions 
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non-existent next iteration of the DFG. Alternatively, if this behavior of execut- 
ing superfluous loads and multiplications can not be tolerated, a loop epilogue 
has to be implemented and the loop end count value for the optimized loop 
has to be decremented. 

The Tables 5.8 and 5.10 implicitly define the functionality of the new opti- 
mized loop instructions PAR_INSN1 to PAR_INSN4. This new functionality 
is more clearly described in Table 5.11. Note that in the unoptimized im- 
plementation, the FRRST/FRRLD instructions use the loop counter in order 
to calculate the effective memory address according to Figure 5.10. Due to 
the fact that the optimized implementation uses a pipelined processing with a 
latency that is larger than the number of loop instructions, the FRRST instruc- 
tions need to calculate the effective address using a decremented value of the 
loop counter^^. This fact is considered in Table 5.11 by the notation adr(CNT) 
and adr(CNT-l). Obviously, the instructions in Table 5.11 have to use many 
operand fields in order to implement the same functionality than the original 
instructions. This would lead to an unacceptable instruction coding width. In 
this case, these operands can be omitted using an optimized hardwired control 
logic due to the fact, that the reusability of these instructions is limited and that 
these optimized instructions do not really need the flexibility of programmable 
operands. 



Optimized 

Instruction 


replaces the following 
more simple instructions: 


PAR.INSN0 


FMUL FR2, FR4, FR7 || FADD FR6, FR7, FR6 || 
FRRLD (adriCNT)), FR4 


PAR.INSN1 


FMUL FR3, FR5, FR6 || FRRST FR5, (adr{Cm-l)) 


PAR.INSN2 


FMUL FRO, FR4, FR6 || FADD FR6, FR7, FR7 
II FRRLD (ar/KCNT)), FR5 


PAR.INSN3 


FMUL FRl, FR5, FR7 || FRRST FR7, (adr(Cm-l)) 



Table 5.11: Functionality of the Optimized Instructions 



With these optimized instructions the enhanced software implementation of 
the loop is given in Listing 5.2 together with the loop prologue. This opti- 
mization has reduced the cycle count for this critical loop by 55.7% which 
translates in a cycle reduction of 17.6% for the overall EVD task according 
to Table 5.12. The moderate reduction in overall cycle count is due to Am- 
dahl’s law [9]: The critical loop which has been optimized contributes to only 
31 .6% and not to 1 00% of the total runtime. If further cycle count reduction 

^^This decremented value of the loop counter corresponds to the value of the loop counter in the previous 
loop iteration of the optimized implementation. 
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is needed, the above-described concept has to be applied to different critical 
blocks in the EVD task. Alternatively, the constraints that have been used for 
the above-mentioned example optimization have to be relaxed by increasing 
the hardware effort for the implementation. In any case, the feasibility of the 
optimized implementation has to be checked by adding the optimized instruc- 
tions to the hardware description of the ASIP and by logic synthesis. Although 
the example optimizations have not used operator chaining in the data path, 
there is the risk of getting an increased critical path due to a higher number 
of area intensive read and write ports of the general purpose register (see 
Appendix E.1 for synthesis results). 



.define EVRLOOP_CNTR R3 
/ / loop prologue 

MOVI ( 0 , EVRLOOP_CNTR) ; // load values for 0-th iteration 
FRRLD(EV_M,EVRLOOP_CNTR,ODRLOOP_CNTR,N,FR4) ; // L1(0) 
FRRLD(EV_M,EVRLOOP_CNTR,ODCLOOP_CNTR,N,FR5) ; // L2(0) 
MOVI (1, EVRLOOP_CNTR) ; // start loop with iteration 1 



FMOV 


FR4, 


FR6 ; 


// 


copy 


FMUL 


FRO, 


FR6 ; 


// 


Ml operation 


FMOV 


FR5, 


FR7; 


// 


copy 


FMUL 


FRl, 


FR7; 


// 


M2 operation 



// start of loop 

LPINI ( EVRLOOP_START_LB , EVRLOOP_END_LB , EVRLOOP_CNTR , N_1 , 3 ) ; 
EVRLOOP_START_LB : 

PAR_INSN1 ; 

PAR_INSN2 ; 

PAR_INSN3 ; 

PAR_INSN4 ; 

EVRLOOP_END_LB : 

// here: no epilogue 



Listing 5.2: Enhanced Loop Implementation Using New Instractions 





Unoptimized 

Implementation 


Optimized 

Implementation 


Loop Cycle Count 


17685 


7830 


Norm. Loop Cycle Count 


100% 


44.3% 


Overall Cycle Count 


63680 


52475 


Norm. Overall Cycle Count 


100% 


82.4% 



Table 5.12: Cycle Count Reduction for Critical Loop 
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5.3.7 Definition of a tightly coupled ASIP Accelerator 

The above-mentioned ISA optimization exposes limited scalability for 
applications that require a very large number of parallel operations. 
This fact is due to bottlenecks in the ASIP architecture caused by area- 
intensive general purpose registers or centralized data memories each of 
them with a limited amount of read and write ports. Furthermore, there 
are also applications with a significantly larger amount of operations in 
the critical loop bodies, which results in a large amount of optimized 
instructions. This in turn requires more complex decoder structures, 
which reduce the overall efficiency of the implementation and can lead 
to clock constraint violations. 

For the case that an ISA optimization fails to deliver the needed compu- 
tational performance or energy-efficiency, a more dedicated ASIP co- 
processor can be implemented. This tightly coupled ASIP accelerator 
can be viewed as a computationally powerful functional unit that sup- 
ports either pipelined or unpipelined computations. Typically, the la- 
tency of such an accelerator is significantly larger than the latency of 
1 or 2 cycles of ordinary functional units like adders and multipliers. 
For this reason, additional control mechanisms are needed in order to 
start the accelerator and to synchronize the accelerator results with the 
program flow of the ASIP. 

Example: The EVD of an NxN hermitian matrix needs 0{N‘^) CORDIC eval- 
uations using the angular and the rotate mode. For a 10x10 matrix this cor- 
responds to at least 270 angle calculations and 270 vector rotations for the 
desired precision. An optimized software implementation of the two CORDIC 
modes results in about 118000 cycles for the 10x10 EVD. By using an ASIP 
accelerator, this cycle count can be reduced by more than 50% to about 
50000 cycles. 

An important aspect for this decision is the required flexibility of a task, be- 
cause a dedicated accelerator architecture is much less flexible and pro- 
grammable than a pure software implementation. In case of the CORDIC 
subtasks for the EVD, the required flexibility is sufficiently low due to the fact 
that the CORDIC algorithm is a well-tested algorithm, which is not prone to 
late design changes and design errors. 

The CORDIC algorithm (see Appendix B.1 for implementation details) uses 
iterative conditional additions and subtractions in order to cancel either the 
angle z (rotate mode) or the ordinate y (vectoring mode) of a two dimensional 
vector. The control data flow graph of these algorithms can be extracted and 
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mapped to dedicated hardware. In case of the EVD, the hardware structure 
that is depicted in Figure 5.11 has been implemented, which supports both 
the vectoring and the rotate mode of the CORDIC. 



X in 



y_in 



z in 









map to right half-plane and set a flag 
if modification was needed 
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X out 
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Figure 5.11: Structure of the CORDIC Accelerator 
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5.4 Verification 

Correctness of a chip implementation prior to chip fabrication is of 
paramount importance due to the high prototype costs of fabrication. 
According to [121] the prototype costs for relative small chips between 
20 sq. mm and 36 sq. mm are in the range between 600k and 900k US$. 
This is mostly due to the expensive production of mask sets and due to 
the low initial volumes for these prototypes. As a consequence, the risk 
of an implementation error should be minimized. This has been a mat- 
ter of course for the design of dedicated hardware since many years, but 
this is also valid for embedded software on a chip: In the case of the 
DVB-T receiver described in Section 7.1 the ASIP software is stored in 
an on-chip ROM. In case of a software malfunction, the software in the 
ROM needs to be modified, which necessitates to restart the fabrication 
process of the chip (at slightly reduced costs) using a different mask set 
for the ROM information. Of course, the program information could 
have been stored in an internal RAM, but this would have increased the 
implementation power consumption and silicon area. Moreover, this 
would have required a more complicated bootload process for the chip, 
which is supposed to operate as easy to use stand-alone solution. 

The term verification is often confused with the term testing. How- 
ever, from a hardware perspective, testing only refers to post-silicon 
fabrication tests which guard against faults in the physical fabrication 
process. Verification rather means the process of checking, if all the 
design constraints (refer to Subsection 3.1.1) are met. This includes to 
check the behavioral equivalence of two descriptions on different lev- 
els of abstraction as well as to verify the algorithmic correctness and 
computational performance of an implementation. 

Verification of the ASIP hard- and software w.r.t. a behavioral reference 
and additional time and interface constraints can be subdivided into the 
following three subtasks: 



• Verification of the ASIP software and the ASIP instruction set sim- 
ulator (ISS): Behavioral equivalence check of the ASIP SW run- 
ning on an ASIP ISS for all possible input patterns. This verifica- 
tion task has to use a cycle-true instruction set simulator in order 
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to determine the cycle count of the implementation, which has to 
meet the cycle constraints of the application^"^. 

After succesful completion of this task, the instance of the cycle- 
true ISS that has been used for this verification has also been quali- 
fied as a golden reference for the following verification of the ASIP 
hardware 

• Verification of the ASIP hardware (processor core): Equivalence 
check between the behavior of the ASIP instruction set simula- 
tor (which is used as a golden reference) and the ASIP hardware 
implementation 

• Verification of the ASIP hardware in the system environment (in- 
terfaces): The check, if the ASIP hardware interfaces comply with 
the specification. For this step, a model of the ASIP environment 
is needed, which allows to integrate the ASIP hard- and software 
in the system environment 

Theoretically, the ASIP software can be verified together with the ASIP 
hardware, using a cosimulation between the ASIP simulated using an 
HDL simulator and the reference software implementation. However, 
due to the slow simulation speed of HDL simulators, this approach leads 
to excessive simulation runtime, which is prohibitive, if many modes of 
operations have to be simulated. The proposed approach separates the 
ASIP hardware verification and the ASIP software verification, which 
significantly reduces the total simulation effort. Furthermore, with this 
methodology it is guaranteed, that the ASIP hardware fully corresponds 
to the ASIP specification, which enables the designer to apply late (pos- 
sibly post-silicon) design changes to the ASIP software, without having 
to worry about errors in the ASIP hardware. 

The ASIP software verification process can (theoretically) be solved 
by exhaustive cosimulation between the behavioral reference imple- 
mentation and the ASIP software running on a fast ASIP instruction 
set simulator. For complex applications with a large number of differ- 
ent operation modes and a large number of internal program states this 
simulation leads to prohibitive long simulation runtimes. In such a case, 

^^This verification step has to be performed in combination with a worst case cycle analysis of the final 
implementation according to Subsection 5.3.5. 
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a divide and conquer approach can be used, in order to verify small, in- 
dependent parts of the application code^^. An alternative would be to 
use a formal description as a reference implementation like in [198], but 
this would necessitate additional design effort (and possibly introduce 
additional mistakes in the formal specification), which is prohibitive to 
get a fast time to market. The use of a thoroughly verified, high level 
language compiler can significantly accelerate this verification step, be- 
cause this corresponds to a correct by construction design approach. 
However, even in this case extensive functional simulations are needed 
in order to verify the correct behavior of the instruction set simulator, 
which is supposed to be used as a reference for the next verification 
task. 

In this thesis the simulation-based verification approach is advocated, 
because this approach does not introduce the overhead of additional 
descriptions. Furthermore, the stimuli for application profiling can be 
reused as a basis to obtain a good application code coverage. It is also 
worth mentioning, that a library-based ASIP design approach signifi- 
cantly facilitates this design task, because the designer only needs to 
verify the application-specific parts of the program and the application- 
specific modifications of the ISS rather than the complete program^^ 
and the complete ISS. 

Example: The verification of the software and the instruction set simulator for 
the DVB A&T application has been achieved using exhaustive simulation of 
all operating modes together with manually generated stimuli for data inputs 
in order to reach all relevant internal states. The instruction coverage of the 
software has been verified with a coverage tool, whereas the Internal state 
coverage has been manually verified using generated histograms. In addition 
to these simulation scenarios, pseudo-random data inputs have been used in 
order to increase the level of confidence in the implementation. This verifica- 
tion task has required about 28% of the total design time. 

The hardware verification of the ASIP processor core is needed to 
check for implementation errors in the ASIP hardware implementations 
w.r.t. the reference instruction set simulator. The purpose of test pro- 
grams in this context is to stimulate the hardware description of the 
processor in order to achieve a certain coverage goal for implementa- 
tion errors. Due to the high abstraction level of a RTL-based hardware 



some cases, the ASIP software has to be modified in order to support this approach. 
^^Provided that the optimizing compiler for the base architecture is 100% error free. 
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description, this verification task is much simpler and needs much less 
stimuli than test vector generation for post-fabrication chip testing. On 
the RT-level of abstraction, the functionalities of operators like adders, 
shifters and multipliers are correct by definition (because they are syn- 
thesized using automatic logic synthesis and do not need to be veri- 
fied exhaustively. However, the scheduling and the interconnection of 
all these RTL-operators and all the storage units as well as the func- 
tionality of all implicitly and explicitly described finite state machines 
have to be verified. Furthermore, manually designed optimized opera- 
tors have to be exhaustively^* verified. 

The metric that implicitly covers parts of this error model is the code 
line coverage of a simulation. However, unlike the case of software ver- 
ification, a full coverage of the HDL code for hardware is only a min- 
imum requirement for verification. There are additional requirements 
like 

a) toggle coverage: each binary node in the RTL-description has to 
switch from 0 to 1 or vice versa at least once during simulation 
- this metric can also extended to groups of nodes, which are re- 
quired to switch to any (possible) binary combination (this metric 
also includes the state coverage of finite state machines) 

b) for finite state machines the so called state, transition and limited 
path coverage has to verify the possible state transitions and check 
whether don’t-care inputs can trigger wrong state transitions 

c) functional coverage exercises a set of error-prone execution 
scenarios in order to check for critical events like pipeline- 
interlocking, data-forwarding or interrupts 

Furthermore, the stimuli and the observation points for the cosimulation 
have to be chosen in order to maximize the effective observability-based 
statement coverage which was first defined in [65]. This means for in- 
stance, that test vectors that are suppressing the propagation of internal 

^^Implementation errors of the synthesis tool itself are obviously possible, but they can not be verified at 
this level of abstraction. In this case it is rather necessary to take advantage of formal verification between 
the RTL description and the synthesized netlist of a design by using commercial equivalence checkers like 
Synopsys’ Formality [237] or CVE [33] which is an in-house tool of Infineon Tech. AG. 

^^For many manually designed operators, (nearly) exhaustive verification is in fact possible, due to the 
low complexity of these operators. 
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incorrect values to the observation points should be avoided. A good 
overview of verification issues is given in [241]. 

The approach that is advocated in this thesis for the RTL design ver- 
ification is based on the work that the author has published in [87]. 
Starting from an instruction set architecture definition and a set of user- 
defined rules, constraints, and test biases, a test case generator (TCG) 
is used to generate the test programs that satisfy the above-mentioned 
constraints. The selected approach enables the support of significantly 
different architectures^^, because the user has direct control over the test 
case generation process. A detailed description of the TCG tool is given 
in Subsection 6.3.2. 

Example: For the DVB-T A&T application automatic test program generation 
has been used in order to verify the behavior of the HDL description for bound- 
ary conditions. Furthermore, functional cosimulation using a subset of the al- 
ready available stimuli for software verification has been performed in order to 
simulate the typical behavior of the implementation. These test programs and 
test stimuli have been added to a regression test suite to verify the functional- 
ity of the implementation after design changes. 

It has to be emphasized that the DVB-T A&T implementation has been de- 
signed in order to ease verification. This has been achieved by a largely 
orthogonal base implementation complemented by application-specific func- 
tional units. The orthogonal base architecture and the application-specific 
functional units have been verified separately. Unorthogonal features like 
multi-cycle, multi-word instructions, and complicated internal state machines 
have been avoided. This significantly eases the debugging process for the 
hardware implementation during cosimulation with the bit- and cycle-true in- 
struction set simulator. This hardware verification task has required about 1 1 % 
of the total design time. 

The last verification task, namely the verification of the ASIP hard- 
ware interfaces is needed to check, if the ASIP interfaces comply with 
the constraints of the system environment. The different verification 
tasks concerning these interfaces are 

• the interface protocol constraints 

• the low level timing constraints of the final synthesized, placed and 
routed design 

• correct interconnections 

^^This methodology has been succesfully applied to a TMS320C25 DSP clone in [87] (accumulator based 
architecture) and to the DVB-T A&T processor of the case study in Section 7.1 (load/store architecture). 
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The low level timing eonstraints have to be eheeked after synthesis of 
the eomplete system and/or after plaee and route. The intereonneetions 
between the ASIP and the system environment as well as the eorreet 
implementation of the interfaee protoeols require a simulation of the 
ASIP hardware deseription (running the ASIP applieation software) in 
a model of the system environment. This model of the system ean either 
be a monolithie RTL-based hardware deseription or a heterogeneous set 
of behavioral or RTL-based hardware bloeks, whieh ean be simulated 
within a eommereial system simulation environment like e.g. Synop- 
sys’ CoCentrie System Studio [238] or Cadenee’s VCC [39]. 

Example: For the simulation of the complete DVB-T system including all digital 
parts of the DVB-T receiver, RTL-VHDL simulation on a high end workstation 
with multiple parallel CPUs has been used. This verification task has required 
about 6% of the total design time. 

For even more complex systems in the future, however, this methodology 
might become difficult due to excessive simulation runtime. Models that use a 
higher level of abstraction and enable faster simulation might be a solution for 
this issue. 



5.5 Concluding Remarks 

This chapter has introduced the proposed ASIP design flow of this the- 
sis. In contrast to previous ASIP design approaches, the ASIP hard- 
and software is in the main design iteration loop of the proposed design 
methodology. This implies that many of the tedious ASIP design tasks 
have to be automated to a large extent in order to obtain a short time-to- 
market. The next chapter briefly describes the LISA tool suite, which 
is able to meet this requirement. A special focus of the next chapter are 
new concepts for hardware generation and verification, that have been 
triggered by this thesis. 




Chapter 6 



The ASIP Design Environment 



This section starts by giving an overview of the LISA^ processor de- 
scription language and the tools that can be generated by the LISA 
design environment^. The focus of this chapter are the concepts and 
tools for AS IP- specific extensions to this former design environment 
that have been developed in this thesis. The features of the latest LISA 
tool suite (status of October, 2003) are summarized in Appendix A. 



6.1 The LISA Language 

The LISA language is based on two different language constructs, 
namely 

• resources, which declare storage units like data and address regis- 
ters, pipeline registers etc. as well as the memory organization 

• operations, which define the instruction syntax, the instruction 
coding and the state transitions that are performed by instruction 
execution 

The RESOURCE section is a straightforward description of the proces- 
sor resources, using a syntax which resembles the definitions of vari- 
ables in the high level language C. Listing 6.1 shows an excerpt of a 
RESOURCE section for the scalar part of the ICORE-II architecture 
described in Section 7.2. 

Eisting 6.1 demonstrates the usage of data types for fixed point arith- 
metic with arbitrary bit width (the bit data type), which is one important 
feature of EISA for ASIP design. The bit width is one key parameter 

^Language for Instruction Set Architecture Description [286] 

^This summary refers to the LISA tools as available, when this thesis was started. In the meantime, 
many enhancements proposed by this thesis have been implemented in the production version of the LISA 
tools. 
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RESOURCE // EVD-PAST-Processor 

{ 

MEMORY_MAP 

{ 

0x0000 -> 0x0200, BYTES{3) : prog_mem [0x0000 . .NUM_PROGMEM_WORDS- 1] , 
BYTES (3) ; 

0x0000 -> 0x0100, BYTES{4) : data_mem_r [0x0000 . .NUM_DATAMEM_WORDS-l] , 

BYTES (4) ; 

0x0000 -> 0x0100, BYTES{4) : data_mem_i [0x0000 . .NUM_DATAMEM_WORDS-l] , 

BYTES (4) / 

} 

PROGRAM_MEMORY long prog_mem [0x0000 . .NUM_PROGMEM_WORDS- 1] ; 

DATA_MEMORY signed bit [MEM_WL] data_mem_r [0x0000 . .NUM_DATAMEM_WORDS 

-1] ; 

DATA_MEMORY signed bit [MEM_WL] data_mem_i [0x0000 . .NUM_DATAMEM_WORDS 

-1] ; 

PROGRAM_COUNTER unsigned int PC; // normal PC 

REGISTER unsigned int BPC; // PC for Branch Processing 

REGISTER unsigned int OPC; 

REGISTER bool BPC_valid; 

REGISTER signed bit [DP_WL] FR_r [0 . .NUM_FREGISTERS-1] ; 

REGISTER signed bit [DP_WL] FR_i [0 . .NUM_FREGISTERS-1] ; 

REGISTER signed bit [ld_NUM_MEM_WORDS] R [0 . .NUM_IREGISTERS-1] ; 

PIPELINE pipe = { FE; DE; EX }; 

PIPELINE_REGISTER IN pipe { 

long instrl, instr2, instr3, instr4; /* 24 bit words */ 
int pc ; 

}; 

long cycle, instruction_counter ; 

/* zero overhead loop support */ 

int ZOLP_active [0 . .NUM_NESTED_ZOLP-l] ; 

int ZOLP_start_addr [0 . .NUM_NESTED_ZOLP-l] ; 

int ZOLP_end_addr [0 . .NUM_NESTED_ZOLP-l] ; 

int ZOLP_R_end_value [0 . . NUM_NESTED_ZOLP- 1] / 

int ZOLP_increment_flag [0 . .NUM_NESTED_ZOLP-l] ; 



} 



Listing 6. 1 : Example RESOURCE Section 

that has to be tailored to an application in order to reduce the hardware 
overhead. Furthermore, the RESOURCE section supports constants, 
which can be included from a standard C header file in order to param- 
eterize the implementation. This helps the designer to keep the descrip- 
tion consistent, which has been proved to be useful in hardware descrip- 
tion languages like VHDE (the GENERIC parameters) and Verilog (the 
parameter values). In case of multiple operations per pipeline stage the 
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keyword REGISTER used together with a compiler flag enables cycle 
accurate behavior of clocked resources. This feature is needed in or- 
der to obtain an unambiguous result regardless of the order of executed 
operations^ 

The OPERATION section in LISA is used to describe the state tran- 
sitions of the resources by specifying the properties of the processor 
instructions. For this purpose, the OPERATION section has several sub- 
sections: 

• a DECLARATION section, which declares instances or groups in 
order to construct a complete tree of operations for one processor 
instruction 

• a CODING section, which defines the coding for the actual oper- 
ation resulting in a coding tree for the complete instruction set of 
the processor 

• a SYNTAX section, which defines the assembler syntax of the ac- 
tual operation resulting in the complete syntax definition for the 
instruction set 

• a BEHAVIOR section, which defines the state transitions per- 
formed by an operation 

• an ACTIVATION section, which triggers the behavior of other op- 
erations 

• an EXPRESSION section, which returns values to parent"^ opera- 
tions 

• a SEMANTIC section, which will be used for future compiler sup- 
port in order to provide additional information to a future compiler 
generator 

^Without this flag and under certain conditions, LISA is prone to race conditions, which resuit in an 
incorrect result of clocked resources. A simple example is a flag that is modified by one operation in a 
pipe stage and that is read by another operation in the same pipe stage. Without the clocked behavior, the 
updated (and wrong) value is read (if the modified operation has been executed first), whereas in case of 
clocked behavior the old registered (correct) value is read in any case. This feature is analogous to race 
conditions in Verilog, which have to be avoided by the designer. 

^The parent of an operation refers to the tree which represents the hierarchy of operations. This tree can 
be seen as a decomposition of an instruction into smaller parts e.g. register fields associated to register read 
operations, memory fields associated to memory accesses etc. 
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Listing 6.2 shows an extract for the description of a coding subtree, 
which is used to describe the branch instructions of the processor in 
Section 7.2. In this listing the operation insnJjranch is the parent of the 
operations insnJjranchxond and insnJjranchjmcond. 



OPERATION insn_branch IN pipe. EX 

{ 

DECLARE 

{ GROUP insn = { insn_branch_cond | | insn_branch_uncond } ; 

GROUP adr = { address_op } ; } 

CODING { ObOO 0bx[l] insn adr } 

SYNTAX { insn adr " ) " } 

BEHAVIOR 

{ 

adr { ) ; // perform address calculation before instr. execution! 

insn ( ) ; // (addresses are temporarily stored in global variables) 

} 



OPERATION insn_branch_cond IN pipe. EX 

{ 

DECLARE 

{ GROUP insn = { FBZ | | FBARGE } ; } 

// cond. branch 

// if register zero (FBZ) 

// if abs (real (registerl) >=register2 
CODING { Obi 0bx[3] insn } 

SYNTAX { insn } 

BEHAVIOR { insnO; } 



OPERATION insn_branch_uncond IN pipe. EX 

{ 

DECLARE 

{ GROUP insn = { B | | LPINI }; } // uncond. branch/init. zero ovhd. loop 
CODING { ObO 0bx[3] insn } 

SYNTAX { insn } 

BEHAVIOR { insn ( ) ; } 

} 

OPERATION FBZ IN pipe. EX 

{ 

DECLARE 

{ INSTANCE freg; } 

CODING { ObO freg 0bx[3] } 

SYNTAX { "FBZ" "(" freg } 

BEHAVIOR { if ( (FR_r [f reg] . ExtractToLong ( 0 , LONGBITS ) ==0 ) && 

(FR_r[freg] . ExtractToLong ( 0 , LONGBITS) ==0 ) ) { 

BPC_valid = 1; BPC = address_tmp_var ; 
PIPELINE_REGISTER(pipe, FE/DE) .flush() ; 
PIPELINE_REGISTER(pipe, DE/EX) . flush () ; } 

} 

} 



Listing 6.2: Example Coding Tree Description for Branch Instructions 
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In Listing 6.3 the coding root for the same processor is described, which 
supports 24/48 bit instruction word widths. The root of the coding tree is 
defined by the keywords CODING AT. The SWITCH statements decides 
as a function of one bit in the coding, whether a one or a two word 
instruction is selected (insnJword or insnJ2word). 



OPERATION decode IN pipe.DE 

{ DECLARE { ENUM InsnType = { type_lword, type_2word} ; 

GROUP Insn_grp_lword = { insn_lword } / 

GROUP Insn_grp_2word = { insn_2word } ; } 

SWITCH (InsnType) 

{ 

CASE type_lword: 

{ 

CODING AT (OPC) { 

PIPELINE_REGISTER(pipe, FE/DE) . instrl == Insn_grp_lword} 

SYNTAX { Insn_grp_lword } 

ACTIVATION { Insn_grp_Iword } 

} 

CASE type_2word: 

{ 

CODING AT (OPC) { 

(PIPELINE_REGISTER (pipe, FE/DE) . Instrl == Insn_grp_2word= [24 . . 47] ) 
&& (PIPELINE_REGISTER (pipe, FE/DE) . instr2 == Insn_grp_2word= [0 . . 23] ) } 
SYNTAX { Insn_grp_2word } 

ACTIVATION { Insn_grp_2word } 

} 

} 

} 

OPERATION insn_lword IN pipe.DE 

{ 

DECLARE 

{ GROUP insn = { insn_branch | | insn_one_cycle | | 
insn_multi_cycle . . . } ; } 

CODING { ObO insn } 

} 

OPERATION insn_2word IN pipe.DE 

{ 

DECLARE 

{ GROUP insn = { FSI_VEC | | FMMMUL_VEC | | FMADD_VEC | | 

FMSUB_VEC ... } ; } 

CODING { Obi ObOO insn } 

SYNTAX { insn } 

BEHAVIOR { PC = PC + I; OPC = OPC +1; 

// insn ( ) ; /* replaces the following activation in this case */ 

} 

ACTIVATION { insn } 

} 



Listing 6.3: Example Description of Coding Root 
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The LISA language has the ability to deseribe pipelined arehiteetures. 
A speeifie operation ean be assigned to a eertain pipeline stage using 
the keyword IN together with a deelaration of the stage names in the 
RESOURCE seetion (see Eisting 6.1). The operations in Listing 6.2 are 
exeeuted in the pipeline stage EX, whereas the operations in Eisting 6.3 
are exeeuted in stage DE. 

The assembler syntax for one instruetion is deseribed by the ensemble 
of all operations that deseribe the behavior and eoding of this instrue- 
tion. As an example, Eisting 6.4 deseribes the syntax of the instruetion 
EADD: A legal instanee of this instruetion is e.g. EADD (ERl, ER7). 



OPERATION insn_2freg IN pipe. EX 
{ 

DECLARE 

{ 

GROUP fregsl, fregdl = { freg }; 

GROUP insn = { EADD | | FSUB | | FMUL } / 

} 

CODING { ObOOl 0bx[4] insn fregsl fregdl } 
SYNTAX { insn "(" fregsl fregdl ")" } 

ACTIVATION { insn } 

} 



OPERATION freg IN pipe. EX 
{ 

DECLARE { LABEL index; } 
CODING { index=0bx[3] } 
SYNTAX { "FR" ~index=#U } 
EXPRESSION { index } 

} 



OPERATION FADD IN pipe. EX 
{ 

DECLARE 



} 



{ 

REFERENCE fregsl, fregdl; 

} 

CODING { ObOOO 0bx[4] } 

SYNTAX { "FADD" } 

BEHAVIOR { FR_r [fregdl] = FR_r [fregdl] + FR_r [fregsl] ; 

FR_i [fregdl] = FR_i [fregdl] + FR_i [fregsl] ; 



} 



Listing 6.4: Example Syntax Description of FADD Instruction 



Eisting 6.4 is also an example for the usage of an EXPRESSION seetion, 
whieh returns the numerie value of the 3 bit register field freg to the 
operation insnlfreg in this ease. These values are not direetly needed 
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in operation insn2J'reg, but rather referenced (keyword REFERENCE) 
and used by the operation FADD in order to perform the actual complex 
addition. 

Listing 6.3 depicts an example for the ACTIVATION section, which is 
used in order to trigger the execution of further instructions that are 
declared after the GROUP keyword. In this case the ACTIVATION 
section in Listing 6.3 can be replaced by an explicit call to insn() in the 
BEHAVIOR section in this example. The order of execution of ACTI- 
VATION and BEHAVIOR section depends on a LISA compiler switch. 
For the current example, this switch is set to execute the BEHAVIOR 
section first. 



6.2 The LISA Design Environment 

Development tools for software and hardware are of paramount impor- 
tance for ASIP designs in order to efficiently profile the applications 
and architectures and to obtain error-free implementations. The appli- 
cation and architecture profiling methodology in Chapter 5 requires a 
retargetable compiler as well as a simulator with profiling capabilities. 
Furthermore, hardware generation using a high level architecture de- 
scription is beneficial to reduce the design time. 

The LISA ASIP design environment uses a single LISA description in 
order to generate the following software design tools: assembler, linker 
and simulator with API as well as debugger, debugger GUI, profiler 
and cosimulation interfaces. Figure 6.1 provides an overview of this 
design environment: The LISA processor compiler^ is the heart of this 
environment and generates the design tools automatically according to 
Figure 6.1. 

The generated tools support a large part of the software development 
process for ASIPs, currently with the exception of a HLL compiler, 
which is subject to ongoing research. 



^The name LISA Processor Compiler does not refer to a high level language compiler, which generates 
assembly code for a processor. The LISA processor compiler is rather responsible to generate the above- 
mentioned software design tools. 
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Figure 6.1: The LISA Processor Design Environment 



A significant part of the overall design time according to Appendix F 
is needed for the hardware description and verification of an ASIP. Re- 
cently^, the hardware generation task has been fully automated by the 
LISA HDL generator, which has been developed at the Institute for In- 
tegrated Signal Processing Systems [218]. 

This HDL generator uses the pipeline description in the LISA RE- 
SOURCE section to automatically generate the ASIP pipeline registers 
as well as a coarse structure of the ASIP. Furthermore, the decoder is 
generated using the information of the CODING sections. Empty wrap- 
pers for the functional units can be automatically obtained, whereas the 
generation of the internal structure of these functional units is currently 
being developed. In order to achieve this additional functionality, the 



®When I wrote the first version of this thesis, this LISA task was still only partially automated. Currently, 
a large part of the enhancements for HDL generation described in the following sections are already fully 
functional in the LISA production tools. 
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LISA keyword UNIT has been introduced in order to bind operations to 
functional units. 

The LISA processor design methodology as available, when this thesis 
has been started, had two considerable disadvantages: 

• the designer had to provide full and consistent CODING informa- 
tion even during design exploration, which often required changes 
in the complete coding tree, especially if additional instructions 
were inserted 

• tedious verification of the complete ASIP description, consisting 
of automatically generated and hand- written parts 

Enhancements to this design methodology are presented in the follow- 
ing section, which remove these disadvantages and enable to obtain op- 
timum results within a significantly reduced design time. 



6.3 Extensions to the LISA Design Environment 

In this section, two extensions to the LISA design environment are de- 
scribed that have been developed in this thesis: An instruction encoding 
and decoder generator, as well as a semi-automatic test pattern genera- 
tor. Apart from the speed up in design time for both approaches, the in- 
struction encoding generator automatically produces instruction encod- 
ings with an optimum coding density to reduce the instruction memory 
size. Furthermore, significant energy savings for a given program will 
be demonstrated by using encoding optimization that takes into account 
the profiling information of the program. 



6.3.1 Instruction Encoding and Decoder Generation 

Initially^, the LISA language required the user to specify the detailed 
coding of each instruction right from the beginning of a design space 

^The current version of the LISA tools includes a fully functional automatic coding capability as sug- 
gested by this thesis. 
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exploration. This had the consequence of tedious manual modifications 
in complete coding subtrees, if new instructions were inserted. 

Automatic coding generation is useful in order to speed up this process: 
The user only has to specify the instruction operands, whereas the in- 
struction opcode field can be omitted. The detailed instruction coding 
is generated automatically exploiting the information of used operands 
that is provided by the DECLARE section. After this process is fin- 
ished, the LISA compiler can generate the software programming tools 
as usual. Furthermore, the hardware description of the decoder can eas- 
ily be generated. For this thesis, an experimental EDA tool for this task 
has been implemented, which is referred to as ICON^ in the following 
discussion. 

The percentage of the instruction memory power of an ASIP can be sig- 
nificant according to the results of the case study in Appendix F. For 
applications with larger instruction memories (or instruction caches im- 
plemented by RAMs) this percentage is obviously even higher. The 
ASIP instruction coding directly affects the size and the energy con- 
sumption of the instruction memory. For this reason, an additional au- 
tomatic optimization step to instruction coding has been developed. 

The instruction coding affects the silicon area for the program memory 
as well as the energy consumption for the following reasons: 

• the power consumption and the area of the instruction memory are 
approximately proportional to the instruction width, provided that 
the access schemes and the toggle activity are constant 

• a large part of the energy consumed in the instruction memory 
depends on the toggle activity of the internal bit lines (refer to 
Figure 6.5), which represent large capacitances 

• in case of external instruction memory, the instruction bus is even 
more heavily loaded by pad and external capacitances leading to a 
considerable power contribution 

As a consequence, ICON has two different optimization tasks: 

• minimization of the instruction width 



'Instruction COding geNerator 
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• minimization of either the 

- internal memory toggle activity or the 

- toggle activity on the instruction bus 

The following definitions of important terms w. r. t. instruction coding 
are used in this discussion: 

• the term instruction (or instruction instance) represents the idea 
of one specific ASIP instruction with an associated behavior e.g. 
MOVI #3, R5 (move the value 3 to register 5) 

• instruction operand refers to the operands of an instruction like 
e.g. #3 or R5 in the above example 

• the term instruction word or instruction code word refers to the 
coded representation of one instruction e.g. “0001 1 1001010...” 

• the term instruction type is the generic term for the set of possible 
instructions with the same behavior and the same operand types 
like e.g. addition of an immediate to a register value (represented 
by the mnemonic MOVI) 

• the term operation code or opcode refers to the part of the instruc- 
tion code word, which determines the instruction type 

• the instruction format determines the position of the opcode and 
the operand fields within the instruction word for each instruction 
type 

For the sake of simplicity and due to the fact that many ASIPs use sim- 
ple fetch and decoding units, only constant length instruction words are 
covered in the following. 



6.3.1. 1 Minimization of the instruction width 

This task can be performed with or without a reduction in flexibility of 
the instruction set. The lower bound of the instruction word width is 
given by the width of the binary word which is needed to enumerate all 
the different instruction instances that are used in a given program. If 
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we count Nused_instrucUons different instructions in a given program this 
results in the minimum width of 



^^instr,min \^^i^used_instructions ) 1 



(6.1) 



This minimum instruction word width has the demerit of massively re- 
ducing the flexibility and reusability of the instruction set, because pro- 
gram changes requiring new instructions with different operand values 
are impossible. Furthermore, this methodology necessitates a signifi- 
cant effort for decoding, which is prohibitive for logic synthesis in case 
of many different supported instructions. 

On the other hand, a reasonable upper bound^ for the instruction word 
width can be obtained by including all the different operand fields each 
of width Woperand,i and the opcode held of width Wopcode for each in- 
struction type into the instruction code word and performing the maxi- 
mum operation, which results in a width of 



^Vinstr,max maX (yVopcode T / ^^operand,i) 

opcode types ^ ^ 



max (W^opcode “1“ ^^operand fields^ (6*2) 
\j opcode types 



The opcodes that are assigned to the Ni^str types different instruction 
types can use e.g. 

• one-hot encoding [46] which yields (the upper bound of the op- 
code width of) 






opcode, one^hot 



= m 



instr types 



(6.3) 



• constant width encoding by simple enumeration resulting in 



opcode, enum |"ld(iVj^5^j. jypes)"| 



(6.4) 



• prefix encoding using code words of different widths (similar to 
the concept of a Huffman-code [119], resulting in the lower bound 
for the opcode width as demonstrated later on) 



^More coding bits would represent a waste, but are obviously possible. 
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The one-hot encoding approach is only useful in order to get an ex- 
tremely simple decoder in hardware, which is irrelevant to the consid- 
ered decoder complexities in the context of ASIPs with a reasonable 
amount of different instruction types (typ. < 100 instruction types). 
For practical implementations either the prefix or the constant width en- 
coding approach is favorable in order to minimize the instruction word 
width. 

After the ASIP designer has determined the useful instruction 
operands the tool ICON optimizes the instruction opcode assign- 
ments. For the prefix code, ICON uses an algorithm similar to the 
one for constructing a Huffman code (cf. Cormen [59], pp. 339-341). 
Instead of the symbol probability of the Huffman approach, the total 
width of the operand fields Woperand fields, j of each instruction j is used 
to build the Huffman tree. Consequently, the annotation of the new node 
has to be performed using the modified update function for the so-called 
merge operation 



w{z) = max{w{x),w{y)) + 1 (6.5) 

instead of the simple addition of probabilities in the original Huffman 
tree construction algorithm. This new update function reflects the fact, 
that the new subtree has the coding width of the maximum of the sub- 
trees incremented by one for an additional decision bit between the 
right and the left subtree. Listing 6.5 depicts the resulting algorithm, 
which constructs an optimum opcode assignment. The optimality fol- 
lows from the optimality of the greedy algorithm for the original Huff- 
man coding problem [59]. 

For instruction sets with differences between the sum of the used 
operand field widths Woperand fields for different instruction types, this 
methodology typically yields a shorter overall coding width Winstr than 
the constant opcode width approach. Otherwise, the width is equal to 
the constant opcode width encoding approach. 

Example: Consider the instructions, which are depicted in Figure 6.2. They 
have the operand field widths Woperand fzeids,i=o -7 = (6, 19, 13, 3, 9, 0, 9, 9) (the 
empty fields represent unassigned bits in the instruction word). 



'"Omissions of operand-fields for certain instructions are possible, but reduce the flexibility and orthog- 
onality of the instruction set. This critical decision should be left to the ASIP designer. 
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/* 


C is a set of n different instruction 


types with 




associated operand widths 


*/ 


MOD HUFFMAN (C) 




1 


n = |C| 




2 


Q = C 


// Q is a prority queue 


3 


FOR i=l TO n-1 




4 


DO z = ALLOCATE NODE 


// -create new node instance 


5 


X = left(z) = EXTRACT MIN(Q) 


// -extract the two min. nodes 


6 


y = right (z) = EXTRACT MIN(Q) 


// operand widths from Q 


7 


w{z) = max (w (x) , w (y) ) +1 


// -new update function 


8 


INSERT (Q, z) 


// -insert new subtree in Q 


9 


RETURN EXTRACT MIN(Q) 


// -return root of Huffman tree 



Listing 6.5: Algorithm for Optimum Opcode Coding Tree Construction 



0: MOV Rs, Rd 
1: MOVI #imm, Rd 
2: R (ARn,$offset), Rd 
3: CLR Rd 

4: MULRsL Rs2, Rd 
5: SLEEP 

6: ADD Rs1, Rs2, Rd 
7: SUB Rsl, Rs2, Rd 



I 17 I 16 I 15 I 14 I 13 I 12 I 11 I 10 I 9 I 8 I 7 I 6 I 5 I 4 I 3 I 2 I 1 I 0 

I Rs I Rd 
imm ^ Rd 



offset 



ARn I Rd 



Rd 



Rs1 I Rs2 I Rd 



Rs1 I Rs2 I Rd 



Rs1 I Rs2 I Rd 



Figure 6.2: Operand Widths of Example Instruction Set 



A constant width encoding approach requires 3 bits in the opcode field to 
code the 8 different instruction types, which results in an overall instruction 
width of 19+3=22 bits (19 bits are needed for the operands of instruction 1). 
The proposed algorithm for prefix encoding requires only 20 bits according to 
Figure 6.3. The final instruction coding for this prefix encoding approach is 
depicted in Figure 6.4. Here, it is obvious that the instruction 1 determines the 
overall instruction width due to the long intermediate field. 

For the real-world instruction set in Appendix C this coding assignment 
optimization also yields a 10% reduction in coding width. It has to be 
emphasized, that this optimization does not impair the flexibility and 
reusability of the instruction set. 
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Figure 6.3: Coding Tree for Example 



A: MOV Rs, Rd 
B: MOVI #imm, Rd 
C: R (ARn,$offset), Rd 
D: CLR Rd 
E: MUL Rs1, Rs2, Rd 
F: SLEEP 

G: ADD Rsl, Rs2, Rd 
H: SUB Rs1, Rs2, Rd 




Figure 6.4: Final Instruction Coding for Example 



6.3.1.2 Minimization of the Toggle Activity 

The first toggle activity optimization that is described here is the opti- 
mization of the toggle activity for on-chip instruction memories. In 
Figure 6.5 the internal structure of a read-only memory is depicted 

"The depicted ROM uses NMOS-bit cells. RAMs use a comparable structure with different memory 
cells. 
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Data Output 



Figure 6.5: Internal Structure of a ROM 
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Typical NMOS-ROMs use a two-phase access scheme, which starts by 
precharging the bit lines to logic 1 in the first phase. The access to a 
row of bit cells is performed in the second phase by the row decoder 
which asserts one word line. If a specific bit cell contains a logic 0, 
the associated bit line is decharged. Otherwise, no decharging activity 
occurs and the bit line remains in the charged state. Figure 6.6 shows 
a model of the ROM with the relevant internal capacitances using ideal 
switches for the bit cells. This ROM model has been used for the power 
evaluations in this thesis. According to the case study in [47] 70% of 
the total energy consumption of an SRAM is required for the bit lines, 
the associated sense amplifiers and the bit cells themselves In the fol- 
lowing, we assume 30% to 60% for the percentage of power consumed 
by the bit line toggle activity of a ROM. 




Data Output 



Figure 6.6: ROM Model with Capacitances 



^^This figure strongly depends on the organization of the memory: shorter bit lines can be traded-off for 
longer word lines. Furthermore, divided bit and word lines or multi-block partitioning can be used [257]. 
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The proposed algorithm to find an optimum instruetion eneoding takes 
advantage of the faet that from an energy perspeetive the 0 bits in the 
memory are more expensive than the 1 bits. A straightforward eonelu- 
sion is to assign Is to eaeh don’t eare bit in eaeh instruetion. Further 
degrees of freedom for this eode assignment are: 

• use of non-inverted or inverted operand fields as a funetion of the 
speeifie instruetion type 

• use of a redundant operand representation with one additional in- 
vert bit, whieh indieates, that the stored value has been bit inverted 

• assignment of maximum weight first opeode eodings, in ease of 
eonstant width eneoding, or swapping of the binary deeision bits 
in the Huffman tree (ref. to Figure 6.3) in the ease of prefix eoding 

The strategy to take one of the above-mentioned deeisions is based on 
instruetion traee files whieh yield the frequeney of eaeh instruetion 
and histograms for the used operand fields. Typieal ROM implementa- 
tions store several memory words in one memory row aeeording to Fig- 
ure 6.5. Eaeh aeeess to one row results in toggle aetivity eaused by all 
the stored words in this row. This faet requires redefining the instruetion 
frequeney as a basis for the above-mentioned optimization: An aeeess 
to one speeifie memory row results in an aeeess to several instruetions 
in this model, thus, the aeeess eount of all the instruetions residing in 
this row has to be ineremented. In ease of divided word line or split bit 
line memory implementations [176], this faet ean readily be taken into 
aeeount, by redefining the above-mentioned instruetion frequeney. 

The deeision, if it is worth storing a eertain operand field (e.g. the im- 
mediate field of a MOVI instruetion) using bit inversion, is based on the 
frequeney of 1 bits and 0 bits in this field. If the frequeney of 0 bits is 
higher than the frequeney of I bits, this immediate field is inverted 

The deeision, if an additional invert bit is to be used for a eertain 
operand field, potentially inereases the total instruetion eoding width. If 
the sum of the operand fields Woperand fields of one instruetion does not 

^^These instruction trace files need to be generated with the instruction set simulator. 

^^The value which is actually stored in the memory is the bit-inverted value of the original value. The 
original value has to be restored by bit-inversion in the ASIP decoder, based on the specific decoded in- 
struction or based on the additional invert bit. 
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fully exploit the necessary instruction word length Winstr, this option 
is an alternative to the above-mentioned static bit inverted representa- 
tion. Otherwise, this decision is left to the designer to trade-off toggle 
activity for instruction word width. 

The opcode assignment is based on a maximum weight sorted set of all 
possible opcode instances assigned to the set of instruction types that 
are sorted according to the frequency in the instruction traces. A similar 
consideration applies to the swap operation of binary bits within the 
Huffman tree. The decision bits in the tree are swapped, if the 1 bit is 
in the branch with the lower frequency. The result of this optimization 
process is optimum for the given degrees of freedom, because of the 
optimality of the individual assignments, which individually maximize 
the number of 1 bits for each field. 

Example: For the real-world ASIP of Section 7.1 the tool ICON has been 
applied in order to perform the above mentioned coding optimization. Table 6.1 
depicts the results of this optimization. The unoptimized binary coding is an 
ad hoc-coding using prefix opcodes and logic Os in the don’t-care positions. A 
significant reduction of the toggle count can be achieved using ICON, which 
yields about 70% reduction of the internal bit line toggle activity. Depending 
on the percentage which is consumed by the toggle activity of the bit lines'^ 
overall power savings of 10% to 20% are achieved for the case study. 



encoding 

technique 


bit line 
toggles 


savings 
(BL toggle 
count) 


unopt. 

binary 


2.93M 




optimized 

encoding 


0.829M 


71,7% 



Table 6.1: Results of Internal Memory Toggle Rate Optimization 

In case of off-chip instruction memory, the optimization of the in- 
struction coding is even more important, because of the impact on the 
toggle activity of the chip pad and external capacitances. The optimiza- 
tion problem in this case is different, because instead of maximizing the 
static number of 1 bits, the number of 0 — 1 and 1 — 0 transitions has 
to be minimized. The optimization problem can be expressed as 

'^Unfortunately, this percentage is unknown for the implemented memory of the case study in Sec- 
tion 7.1. In this case, we assume a percentage between 30% and 60% as mentioned above. 
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j — ^Vinstr — 1 ^ ^traceUiength 1 

SJgjrp min E E ITij © ITi_i j (6.6) 

j=0 i=l 

where ITi j represents the j-th bit of the instruetion i in the instrue- 
tion traee of length N trace jength- The exelusive-or relation ean also be 
rewritten as 

j — P^instr — 1 '^—^tracejiength 1 

aigjrp min E E 

j=0 i=l 

(6.7) 

The instruetion eoding for Equation 6.7 obviously has to meet addi- 
tional eonstraints like unambiguous opeode and operand field assign- 
ments, whieh have to be respeeted during the optimization proeess. 

The degrees of freedom for this optimization are 

• assignment of don’t eare bits 

• use of non-inverted or inverted operand fields or redundant repre- 
sentation with invert bit 

• assignment of opeode eodings 

• position of operand fields for eaeh instruetion type^^ 

This optimization is diffieult to handle, beeause of the huge problem 
eomplexity and the unavoidable overlap of different instruetion fields. 
For this reason, the tool ICON uses a heuristie optimization teehnique, 
whieh is motivated in the following. Consider the ease of a eonstant 
width opeode field, whieh oeeupies a eertain bit range in the instruetion. 
Without a loss in generality, we assume that this range oeeupies the bit 
positions 0 to W opcode, enum ~ 1- Thus the task of optimizing the opeode 
assignment for this ease has to minimize the expression 

3 ^'Topcode,enum 1 ^ ^traceUiength 

E E (6.8) 

j=0 i=2 

^^The position of the opcode field itself has to start at a fixed location in order to enable the decode 
operation. This is equally true for split opcode fields, which are not explicitly covered in this discussion. 
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which is only a function of the actual opcode field assignment. The 
transition matrix for these opcode fields opcode{h j) is defined as the 
number of transitions between instruction type i and instruction type j, 
where i,j E {0, Ni^str types ~ This matrix can easily be computed 
using the relevant information of the instruction trace. The heuristic 
of opcode assignment starts by finding a maximum value in the off- 
diagonal elements of matrix T opcode, which yields two instruction types 
i and j with the highest transition frequency. Two codings of width 
Wopcode,enum with a Hammming distance of one are assigned to these 
instruction types i and j. The maximum value in matrix T opcode is 
marked assigned and the heuristic continues by finding the next unas- 
signed maximum value in the off-diagonal fields and assigns a coding 
with minimum Hamming distance If several assignments are possi- 
ble, the tool selects the coding that minimizes the incremental toggle 
count considering a parameterizable number of other already assigned 
instruction types. 

This Greedy algorithm continues until all instructions are assigned. The 
degrees of freedom in this algorithm are chosen randomly in order to 
prune the complexity of the algorithm'^. This approach allows to reit- 
erate this algorithm several times in order to find better solutions. 

Similar heuristics have been used to optimize the remaining assignment 
problems, which are more complicated, due to the fact that the different 
operand fields and the opcode field with prefix coding have overlap- 
ping bit ranges. Furthermore, the position of operand fields for each 
instruction type introduces an additional degree of freedom^°. Heuris- 
tics for the opcodes and the operand assignments are used as a basis 
for a genetic optimization algorithm. The reproduction function of this 
algorithm uses code exchanges both within one code assignments and 
between several instances of code assignments. Table 6.2 shows the op- 
timization results for a real-world case study using the above-mentioned 

'^Generally, there is no single maximum value in matrix Topcode- 

**If one instruction type which is associated to this maximum is already assigned, the heuristic assigns a 
coding to the other instruction types with minimum Hamming distance. 

^^Exhaustive optimization for a relevant problem size is impossible due to the algorithmic complexity. 
Optimization approaches, which take into account several degrees of freedom (and use significantly more 
runtime), have not been able to yield significantly better results. 

^^Commercial processors often use fixed positions for operand fields which occur in several instruction 
types. This methodology saves multiplexers in the decoder. On the other hand, the area and power con- 
tribution of this part of the decoder in our case studies in Appendix F clearly shows that this additional 
complexity for the considered word widths is negligible. 
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toggle optimization. If we assume a load capacitance between IpF and 
lOpF of the external instruction bus, this optimization saves between 1.8 
and 9mW in total system power for a lOOMHz system clock. Compared 
to the power consumption of the ASIP in Appendix F, which is in the 
order of 20mW, this saving is significant. 



encoding 


absolute 


savings 


technique 


toggles/ 1E3 




adhoc assignment 


521 


0% 


optimized 


241 


53.7% 



Table 6.2: Results of Instruction Bus Toggle Rate Optimization 

Once the code assignments of all the instructions is finished, the task of 
generating the actual hardware decoder is straightforward. Currently, 
ICON uses VHDL as target description language. Listing 6.6 depicts 
extracts of the ICON-generated decoder description for a MIPS instruc- 
tion set without coding optimizations. This decoder generation takes 
advantage of the capabilities of VHDL to use structured data types like 
records and enumeration types. This eases the development and the 
debugging of the code, because the instruction mnemonics instead of 
the instruction binary codes are shown in the waveform viewer during 
simulation . 



6.3.2 Semi-Automatic Test Case Generation 

In Section 5.4 the importance of hardware verification has been moti- 
vated. The proposed verification task can be facilitated using a test case 
generation (TCG) tool to automate the generation of test programs and 
test stimuli. This TCG tool has been conceived in order to provide stim- 
uli for the cosimulation between a golden reference (which is typically a 
high level instruction set processor^^) and a given hardware description. 
This simulation-based approach is comparable with the methodology 
typically used for commercial processor designs (cf. e.g. [68] [118] 
[149] [168] [174]) in order to cover typical fault models of the imple- 
mentation. 

^*In [87] the equivalence between a VHDL implementation and a physical instance of the commercially 
available TMS320C25 processor has been verified. 
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entity score p redecoder is 
port { elk, rstq: in std_logic; 
insert_nop, insert_idle: in std_logic; 
predecode_input : in std_logic_vector (22 downto 0); 
predecode_output : out compl_operation_t) ; 

end score predecoder; 

architecture predecode of score_predecoder is 

signal predecoder_register : std_logic_vector (22 downto 0) ; 

begin 

process (predecoder_input) 
begin 

-- opcodes 

case predecode_input (22 downto 19) is 
when "0000" => 

predecode_output . opcode <= addi_op; 
when "0001" => 

predecode_output . opcode <= andi_op; 
when others => 

case predecode_input (22 downto 15) is 
when "11100000" => 

predecode_output . opcode <= add_op; 
when "11110000" => 

predecode_output . opcode <= sub_op; 

-- operands 

case predecode_input (22 downto 19) is 

when "0000" | "0001" | "0010" | "0011" | "0100" | "0101" | 

"0110" I "0111" I "1000" I "1001" I "1010" I "1011" I 
"1100" I "1101" => 

predecode_output . regO <= conv_unsigned (0, 5) ; 

predecode_output . regl <= unsigned ( predecode_input (18 downto 14)); 
predecode_output . reg2 <= unsigned ( predecode_input (13 downto 9)); 
predecode_output . imm <= signed ( predecode_input (8 downto 1) ) ; 
predecode_output . addr <= conv_unsigned (0, 8) ; 
predecode_output . raddr <= conv_signed (0, 8); 
when others => 

case predecode_input (22 downto 15) is 

when "11100000" | "11100001" | "11100010" | "11100011" | 

"11100100" I "11100101" I "11100110" I "11100111" I 

end case; 
end case; 
end process; 
end predecode; 



Listing 6.6: Extract of a generated VHDL Decoder 



If the abstraction level of operator-based RTL hardware design is used, 
the fault model for this verification task does not have to cover the oper- 
ator implementations themselves^^. The fault model for this verification 



^^This statement does not cover custom designed operator implementations but is rather valid, if operator 
implementations are taken from a synthetic operator library like the DesignWare library of Synopsys [233]. 
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task rather focuses on the correct interconnections and the scheduling 
of the instanciated high level operators as well as on custom designed 
operators and on finite state machines. 

Custom designed operators should be verified in a first verification step 
by e.g. exhaustive simulation or formal techniques. Exhaustive or 
nearly exhaustive simulation of these operators is very often possible, 
due to the limited amount of inputs for these operators. This verifi- 
cation task corresponds to the use of a divide-and-conquer verification 
methodology and helps to reduce the total amount of simulation vectors 
in order to obtain a certain coverage of the implementation. 

The proposed TCG tool needs the user to evaluate the coverage of a 
given test program^^ suite and to define the basic structures of new 
test programs that enable a higher simulation coverage. Theoretically, 
this step can also be automated using an approach similar to [260] or 
[108]. Unfortunately, these approaches require a formal description of 
the hardware behavior and are restricted to simple architectures. 

According to the results of this thesis that have already been published 
in [87], test case generation using pseudo-random test vectors typically 
achieves a remarkable percentage of state and execution coverage. This 
statement agrees with the results in [260] . The uncovered part of the 
design represent typically less than 5% to 10% percent and have to be 
covered using manual interaction. 

The proposed TCG tool of this thesis is able to generate pseudo-random 
test programs and test EO stimuli. Furthermore, the user can control 
this test vector generation with so-called rules to obtain a certain struc- 
ture in the test program. By using these rules the user can describe 
e.g. program loops with a defined exit condition or similar conditional 
constructs. An example for these rules is given in Listing 6.7. These 
rules use a C-like description style in order to call a predefined ASIP- 
specific function (genJnstr) to generate an instruction. Instructions 
(e.g. RPTKJNSN for the C25 repeat instruction RPTK) and instruction 
groups (e.g. ANY_REPEATABLEJNSN) can be selected as argument 
for the function genJnstr. Furthermore, the arguments of these instruc- 
tions can be constrained using e.g. the random function RND. The next 



^^Note that the term test program refers to the program which is actually used to verify the hardware 
description and not for hardware testing. 
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Step is to speculatively simulate the generated instruction(s) and insert 
them - after successful simulation - in the test program (using the func- 
tion simulate jjndxommit). Other functions are able to generate labels 
(genJabel) and to generate random values that can be retrieved later on 
(clear Mrr, ST_RND and REJVAL). 



/* rule to use a C25 repeat immediate 
RPTK RULE: 


instr. */ 


gen instr(RPTK INSN, RND (3 , 20 ) ) ; 


// generate RPTI instr. 


gen instr(ANY REPEATABLE_INSN) ; 


// generate repeated instr. 


simulate and commit (30); 


// speculative execution of max. 
// 30 cycles and commitement in 
// case of successful simulation 


/* rule to generate a loop with constant iteration count (ICORE) */ 
LOOP RULE: 


gen instr(LPINI INSN, RND {2 , 10 ) , PC+6 ) ; 


/ / random repeat count 


for(int i=0; i< = 5; i+-i-) 


// generate 5 random arith. 


gen_instr (ANY ARITH INSN) ; 


// instructions with rand, operands 


simulate and commit (60); 


// speculative execution of max. 
// 60 cycles and commitement in 
// case of successful simulation 


/* rule to check, if BEQ works (ICORE) 
BEQ RULE: 
clear arr ( ) ; 


*/ 


gen instr(MOVI INSN, ST_RND (- 128 , 127 ), ' 


'R"ST RND(0,7) ) ; 

// generate MOVI instr. and 
// remember the operand values 


gen instr(CMPI INSN, RE VAL(0),"R"RE VAL(l)); 

// generate CMPI instr. using the 
// remembered operand values 


gen instr (BEQ INSN, "BEQ OK LB")/ 


// generate BEQ instruction 


gen instr (HALT INSN) ; 


// halt on error 


gen label ("BEQ OK LB' ' ) ; 


// generate unique label 
// BEQ_LB_#xyz# 


simulate and commit (10); 


// speculative execution of max. 
// 10 cycles and commitement in 
// case of successful simulation 



Listing 6.7: Example Rules for Test Generator 

According to these rules, the TCG tool can generate specialized in- 
struction sequences that are necessary to produce valid DSP assembler 
code^"^ even for unorthogonal architectures. Furthermore, the test case 
generator has to meet constraints of the target architecture in order to 
generate valid code. Examples for these constraints are register file and 
stack sizes, valid memory spaces or restricted parameter ranges for cer- 

^^For the TMS320C25 specialized instruction sequences are needed to verify the RPT-instruction and 
some arithmetic instructions like SUBC. 
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tain arithmetic instructions. The degrees of freedom for the instruc- 
tion generation that are not restricted by the generation rules and the 
constraints are determined using a pseudo-random generator. So-called 
biases using non-uniform probability functions increase the portion of 
the generated code that stresses error-prone corner cases. Examples for 
these comer cases are compare results equal/unequal to zero and arith- 
metic overflows. 




Feedback of Coverage Information to User 



Figure 6.7: Structure of the Automatic Test Case Generator 



Figure 6.7 depicts the structure of this test case generator. Due to the 
fact that the rules can leave many degrees of freedom to the pseudo- 
random code generator, it is not guaranteed that the initially generated 
code meets the constraints of the target architecture. In order to solve 
this problem, the generated instructions and instructions sequences are 
speculatively simulated with an instruction set simulator to check the 
validity of these instructions (e.g. to avoid memory access violations, 
incorrect operand values etc.). If a certain instruction/instruction se- 
quence is valid, it is appended to the final test program and the next 
instruction/instmction sequence is generated. This concept enables the 
test program generator to assess the state coverage of the generated test 
program and immediately provide feedback to the user, who can modify 
the rules and biases in order to enhance the coverage. The test generator 
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is also able to generate stimuli for the input ehannels of the proeessor 
as well as external events like IRQs, suspend and resume signals. Af- 
ter a suite of test programs and stimuli have been generated, the HDL 
deseription of the proeessor has to be eosimulated with the golden ref- 
erenee. In order to debug the HDL deseription either a eosimulation 
interfaee or generated traee files ean be used, whieh traee the results of 
program tasks or the eomplete program flow with all relevant states. 



6.4 Concluding Remarks 

This ehapter has briefly introdueed the eoneept of the LISA language 
and the important software and hardware design development tools that 
ean be generated using this deseription. ASIP-speeifie extensions to 
this tool suite have been introdueed eovering the optimization of the 
instruetion eneoding in order to save energy and the generation of the 
hardware deeoder. With this methodology signifieant savings in energy 
and design time have been aehieved for the real-world ASIP ease study 
in Seetion 7.1. 

Furthermore, a test ease generator (TCG) to support and faeilitate the 
generation of ASIP test programs and test stimuli has been presented. 
This tool partially automates the tedious proeess of generating test pro- 
grams and helps to reduee the design time while enabling a higher simu- 
lation eoverage. This TCG provides a transparent C-like seript language 
as user interfaee in order to generate meaningful programs to stimulate 
error-prone parts of the design. 

In order to transform the LISA ASIP design environment into a design 
platform for ASIPs a eomprehensive library of proeessor templates and 
example proeessor designs for various applieation domains has to be 
implemented. These library elements should inelude LISA deseriptions 
and optimized HLL eompilers as well as fully verified hardware de- 
seriptions. The availability of these templates and examples will be a 
key faetor to speed up iterative instruetion set optimization for many 
applieations. 




Chapter 7 



Case Studies 



In this chapter the results of two case studies are presented to prove the 
feasibility of the proposed design flow. The first case study is about an 
acquisition and tracking control processor for terrestrial digital video 
broadcasting (DVB-T) and demonstrates the impact of ASIP optimiza- 
tions on energy-efficiency. The second case study covers an ASIP for 
linear algebra kernels with a special focus on eigenvalue decomposition 
of hermitian matrices. This case study also compares the design and 
implementation efficiency of an optimized ASIP implementation with a 
general purpose processor core. 



7.1 Case Study I: DVB-T Acquisition and Tracking 

This case study is about the design of an ASIP that controls the ac- 
quisition and tracking process in a DVB-T receiver. Figure 7.1 depicts 
the simplified structure of this receiver. The DVB-T standard [75] uses 
coded orthogonal frequency division multiplex (COFDM) as a trans- 
mission technique. The term coded in this context means, that the 
transmitted data are protected against transmission errors by using con- 
volutional and block codes. Orthogonal frequency division multiplex 
on the other hand means, that the transmission channel in the frequency 
domain is subdivided into equidistant subchannels. This principle can 
be viewed as a modulation of many equidistant carriers with a corre- 
sponding data sequence. The realization of this transmission technique 
uses the inverse discrete fourier transformation (IDFT - implemented 
by the inverse fast fourier transformation) for modulation and the dis- 
crete fourier transformation (DFT) for demodulation. In order to avoid 
inter-symbol interference (ISI) of two consecutive OFDM symbols, a 
guard interval is inserted that has to be longer than the duration of the 
channel impulse response. Frame synchronization in the receiver with 
respect to the position of the guard interval is needed to reduce the ef- 
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fective inter-symbol interference. Frequency synchronization before the 
DFT in the receiver mitigates the effect of a sampling frequency offset 
between the transmitter and receiver. The signal after the DFT in the 
receiver corresponds to the original signal before the IDFT multiplied 
with the DFT of the channel, provided that the frame synchronization 
and the frequency synchronization in the receiver are perfect. For DVB- 
T, phase correction after the DFT is performed with pilot carriers, be- 
cause each subchannel uses a QAM modulated carrier which is sensitive 
to phase errors. 

The underlying application of this section computes the acquisition of 
the FFT window position (timing synchronization), the sampling clock 
synchronization (interpolation/decimation control) and the carrier fre- 
quency offset estimation (frequency synchronization) [89]. Further- 
more, after the acquisition is finished, the common phase error, the 
frequency error and the sampling error are continuously tracked. This 
lock condition is permanently monitored and automatic reacquisition is 
performed if needed. 

The implementation for this DVB-T acquisition and tracking (DVB-T 
A&T) application has been named ICORE^. The final ICORE imple- 
mentation including the hard- and software is integrated as one design 
module into a commercial single chip. This system-on-a-chip solution 
for DVB-T supports enhanced algorithms and features compared to the 
previous receiver generation [123]. 




Figure 7.1: Digital Part of the DVB-T Receiver 



hCORE is the abbreviation for ISS-Core. 
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7.1.1 Application Profiling and ASIP Class Selection 

The results of applieation profiling are briefly summarized in this sub- 
seetion. For a more detailed deseription of this design task refer to Sub- 
seetion 5.2.2, where the DVB-T A&T applieation is used as a vehiele to 
illustrate the profiling methodology. 

Applieation profiling reveals the following results: 

• eyele eount violations of the profiling software implementation 

• high data loeality 

• many data dependent branehes (whieh are diffieult to prediet) 

• many bit-oriented operations and arithmetie saturation operations 

• insignifieant part of regular DSP operations like e.g. FIR filters 

• long idle intervals 

• the time eritieal tasks frequently need aretan-eomputations 

These applieation profiling results are used in order to determine a suit- 
able ASIP elass as a starting-point for further optimization. The high 
data loeality of the applieation suggests a typieal load/store arehiteeture 
with a general purpose register file. A short proeessor pipeline is ad- 
vantageous, in order to deerease the penalty of stall eyeles due to the 
high number of branehes. Generally, simplieity of the hardware is pre- 
ferred over eomplexity to reduee the design and verifieation effort. This 
approaeh also enhanees the maintainability and reusability of this de- 
sign, whieh are two of the primary design goals together with a high 
energy-effieieney. 

In order to find the optimum pipeline organization, several implemen- 
tations with a different pipeline length and organization have been im- 
plemented for the DVB-T A&T applieation^. A detailed overview of 
the pipeline organization for the different alternatives is given in Ap- 
pendix D. Table 7.1 depiets the results of this design spaee exploration 
for the DVB-T A&T applieation benehmark. The silieon area displayed 

^These implementations have been separately optimized for runtime and in order to obtain a minimum 
in energy consumption. 
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in Table 7.1 increases and the critical path decreases with an increas- 
ing number of pipeline stages, as expected. For the overall runtime, 
the number of cycles has been multiplied with the critical path of the 
implementation. Due to data and control hazards reducing the average 
resource utilization in the ASIP pipeline, the change in absolute run- 
time is less than proportional to the change in the critical path for an in- 
creasing number of pipeline stages. Interestingly, a minimum in energy 
consumption is observed for the three-stage pipeline implementation. 

Further investigations reveal that the two-stage implementation needs 
more energy due to logic glitches caused by a significant signal slack. 
The three-stage implementation reduces this effect by retiming of un- 
balanced signal arrival times in the additional pipeline stage. More- 
over, slower and smaller arithmetic operator implementation are in- 
stanciated in the three- stage implementation due to the more relaxed 
timing constraints. The four-stage implementation needs more energy 
than the three-stage implementation in the clock circuitry and the flip- 
flops for the pipeline registers as well as in the hazard detection and 
resolution logic. For the three- and four-stage implementations, which 
use a predict- untaken branch scheme, taken branches result in a branch 
penalty of 2 and 3 cycles resp. This branch penalty results in a higher 
energy consumption of the four-stage implementation due to redundant 
additional fetch and decode operations. 

For the DVB-T receiver system, the two stage implementation violates 
the given clock cycle constraint of the system environment. Further- 
more, the four stage architecture requires higher effort for verification 
and design due to the hazard detection and resolution logic. The flat 
minimum in energy consumption is another reason, that the three-stage 
implementation has been the final architecture of choice for the DVB-T 
A&T application. 



# of pipeline stages 


2 


3 


4 


norm, area 


100% 


103% 


120% 


norm. crit. path 


100% 


83% 


72% 


norm, benchmark runtime 


100% 


92% 


86% 


norm, energy 


100% 


85% 


106% 



Table 7.1: Results for Different Pipeline Structures 





7.1. Case Study I: DVB-T Acquisition and Tracking 



149 



It has to be pointed out, that the comparison in Table 7.1 uses the 
best implementations w.r.t. energy consumption including all possible 
power optimizations of the next subsections for each case. For the sake 
of conciseness, the discussion of these optimizations in the following is 
restricted to the three-stage implementation. 



7.1.2 Iterative Instruction Set Optimization 

The purpose of typical ASIP instruction set optimizations is to enhance 
the computational performance. If the performance goals are reached, 
additional optimizations can be applied in order to increase the energy- 
efficiency of an application. The following two examples illustrate the 
effect of these instruction set optimizations. The first example is an 
optimized instruction performing the saturation of an integer value to 
the number range of a 2’s complement number with programmable bit 
width. The second example is a CORDIC computation in vectoring 
mode, which uses several highly optimized instructions. 



7.1.2.1 Example 1: Saturation 

In the instruction traces of the profiling implementation more than 12% 
of the total executed instructions are used for saturation to a power of 2. 
This kind of saturation is defined by the following simple relation 

f 2" - 1 if m>2'^ -1 
sat(n, m) = < —2"' if m < —2"' 

[ m else 

Many commercially available DSPs offer a so-called saturation mode. 
This mode is usually restricted to saturate accumulator values with a 
long bit width to the shorter data path width of the DSP. However, the 
above-mentioned saturation task, saturates to any valid bit width from 
1 bit to the full data path width. This property is required by the given 
DVB-T A&T algorithm in order to guarantee, that a certain range for 
critical output and intermediate values is not exceeded and malfunctions 
due to wrap-around are avoided. 
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Two implementation alternatives are considered for this task (cf. Fig- 
ure 7.2): 

• a pure software implementation^ that uses the basic profiling in- 
struction set of Table 5.4 

• a specialized instruction SPJ^(bw,Rn), which saturates the register 
Rn with a dedicated functional unit in hardware to the minimum 
and maximum values of a 2’s complement number with bit width 

bw 

Optimized SW impiementation SW Impiementation with Profiling Instruction Set 



SATURATE(15,R3); 




Figure 7.2: Two Implementations for Programmable Saturation 



The specialized SAT instruction is executed in a single instruction cycle 
without increasing the critical path of the implementation, whereas the 
conventional solution needs an average of 14.9 cycles in the benchmark 

The result of the power and energy evaluation is given in table 7.2. The 
average power of the profiling implementation is about the same as the 
power of the optimized one. However, the results clearly show that 
the optimized implementation is far superior in energy-efficiency and 

^For this implementation, a subroutine has been used in order to save program memory. Furthermore, 
in order to avoid the computation of the limits, a look-up-table is needed for this implementation. 
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Implementation 


Unoptimized 
Profiling ISA 


ISA with 
specialized SAT 
instruction 


avg. cycles per 
saturation 


14.9 


1 


avg. power [mW] 
(only sat.) 


9.8 


10.9 


avg. Energy [nJ] 
per saturation 


1.83 


0.136 


Relative avg. 
energy per 
saturation 


100% 


7.4% 



Table 7.2: Results of the Saturation Benchmark 



performance, because of the significant cycle count reduction. If ad- 
ditional spill code had been necessary to avoid overwriting of register 
values by the subroutine, the energy consumption of the unoptimized 
version would even have been increased due to more instructions and 
several RAM accesses. 

Other implementations of the saturation task are also possible e.g. in- 
lining of the code using constants for the limits, calculation of the 
max/min- values on the fly etc. However, all of these implementations 
require more program memory and more instructions than the optimized 
implementation resulting in a significantly lower energy-efficiency. 



7.1.2.2 Example 2: CORDIC 



The DVB-T A&T application require the CORDIC algorithm in vector- 
ing mode to calculate the angle between a 2-dimensional vector (x, y) 
and the x-axis (atan{x/y)). This CORDIC task requires a significant 
amount of runtime in the time critical tasks of the profiling implemen- 
tation, and, therefore, is a candidate for thorough optimization. 

The time and power consuming CORDIC loop body has been imple- 
mented with differently specialized instructions according to Figure 7.3: 
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• hand-programmed implementation with the profiling instruction 
set (denoted Implementation 1 in Figure 7.3 and Table 7.3) without 
zero-overhead-loop instruction 

• specialized instructions (including zero-overhead loop support) 
using special purpose hardware units like shift-and-round"^, con- 
ditional addition/subtraction etc. {Implementation 2) 

• even more specialized instructions and a second 
addition/subtraction-unit in the core {Implementation 3) 



Implementation 1 Implementation 2 Implementation 3 

(partially reproduced) 




Hardware Effort: 


1 Shifter 


same as 


1 Shifter 




1 Adder/Subtractors 


implementation 1 


2 Adder/Subtractor 




1 Atan-Table 




1 Atan-Table 



Figure 7.3: Implementation Alternatives for CORDIC Loop Body 



In table 7.3 results of the CORDIC task evaluation are depicted. The 
average energy consumption of this task is normalized to the profiling 
ISA implementation. 



Implementation 


Impl. 1 


Impl. 2 


Impl. 3 


avg. cycles per 
CORDIC call 


663.3 


154.8 


82.4 


relative avg. power 


100% 


115% 


127% 


relative avg. energy 
per CORDIC call 


100% 


18.8% 


15.8% 



Table 7.3: Results of the Different CORDIC Implementations 



'*The CORDIC task for implementation uses rounding of the LSB after shifting. An alternative to this 
algorithm is to use at least one additional fractional bit. 
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Table 7.3 clearly shows the effect of instruction set specialization on the 
average power consumption: Due to the parallel execution of operations 
in the more specialized implementations, the average power increases. 
However, the decrease in runtime of the CORDIC task overcompensates 
this increase and results in significant energy savings. Table 7.3 shows, 
that the best optimized version of the CORDIC consumes about 6.7 
times less energy than Implementation 1 (profiling instruction set). 



7.1.3 Overall Energy Optimization Results 

Instruction set optimizations typically decrease the runtime of the pro- 
cessor for a given task. In order to continue increasing the energy- 
efficiency of an implementation, additional instruction set optimizations 
can be applied, even if the runtime constraints of a given application are 
already met. Apart from ISA optimizations, additional architectural op- 
timizations can be used that have been described in Section 3.3. 

This subsection summarizes the energy optimization results of the 
ICORE three-stage implementation. The numbers in Figure 7.4 are re- 
lated to the effect of incremental optimizations beginning with an im- 
plementation using the profiling instruction set of Section 5.2.2. These 
optimizations are sorted starting with the least effective optimization 
(logic restructuring) and ending with the most effective ones (clock gat- 
ing and instruction set optimization). It is important, that clock gating 
has to be introduced before instruction set optimization, otherwise, the 
benefit of a longer processor sleep period due to faster processing is 
significantly reduced. 

Reorganization of logic gates and operators^ is the least efficient op- 
timization yielding about 10% in energy reduction, but without signif- 
icant increase in design time. Blocking gates^ reduce the energy by 
roughly another 10% while increasing the area by less than 1%. The 
reduction of the internal instruction ROM toggle activity yields about 
20% in energy reduction using automatic optimized encoding^ without 
affecting area or design time. This saving depends on the size and or- 
ganization of the instruction memory. Application-specific instruction 

^Refer to Subsection 3.3.2 for a description of this energy saving technique. 

^This technique has been developed and published in [129] [90] and is based on the tool which is 
described in Subsection 6.3.1 
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set optimization^ cuts energy consumption by another 50%, while in- 
creasing the design effort significantly due to manual optimization. The 
applicability of this optimization strongly depends on the computational 
tasks of the application. The benefit of clock gating in combination 
with the sleep mode of the core yields a factor of about four in energy 
reduction, because the processor for the DVB-T A&T application has 
long idle intervals. This value strongly depends on the workload of the 
ASIP. It might be argued that the processing power of ICORE is signifi- 
cantly over-dimensioned for the given application. This is not the case: 
The runtime constraints of the application rather represent tight bounds 
for the ICORE tasks that have to be met by this implementation. 

The overall power reduction for the three- stage ICORE implementation 
with all the above-mentioned optimizations is about 92%. It has to be 
pointed out that all these optimizations do not compromise the flexi- 
bility and maintainability of this building block for late design changes 
that require the implementation of additional software programmable 
tasks. 

In the following discussion, the implementation of an ASIP accelerator 
for the computationally intensive CORDIC task is explored. This im- 
plementation is similar to the example of Subsection 5.3.7, but in this 
case stripped down to a CORDIC for vectoring mode. This coproces- 
sor has been implemented and connected to the ASIP core. Table 7.4 
shows the area and energy consumption for the different implementa- 
tions (ASIP with/without coprocessor) for the CORDIC task and, addi- 
tionally, for the overall DVB-T A&T benchmark tasks (which include 
several CORDIC evaluations). The overall savings for the complete 
tracking tasks are about 38%. 

Nevertheless, the implementation of an accelerator breaks the design 
paradigm of an instruction set oriented ASIP and introduces more het- 
erogeneity into the implementation. Maintainability and reusability of 
this building block become more complicated, which outweighs the ad- 
ditional gain in energy-efficiency for the DVB-T system. Consequently, 
the final ICORE has been implemented using the above-described soft- 
ware CORDIC implementation. 



^Refer to the previous subsection for an example. 
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Figure 7.4: Incremental Power Optimization of ICORE 



Generally, it can be observed that the energy-efficiency of an implemen- 
tation increases, if specialization for a given application is introduced. 
The above-mentioned instruction set optimization and the implementa- 
tion of accelerators represent an incremental modification resulting in 
increased energy-efficiency. At the same time the overall flexibility and 
reusability of the processor for unexpected tasks with low to medium 
required computational performance is preserved. For the performance 
critical computational tasks like the CORDIC computation mentioned 




156 



Chapter 7 . Case Studies 



ASIP 


without . 


with 




accelerator 


accelerator 


area (ND2 equ.) 


52k 


56k 


norm, energy 
(only CORDIC) 


100% 


7.8% 


norm, energy 
(overall) 


100% 


62% 



Table 7.4: Results for ICORE with/without coprocessor 

above, specialization with additional instructions or a coprocessor re- 
duces the flexibility of the implementation: A change of the underlying 
algorithm requires a redesign of the hardware in this case. The latter 
case is a typical example for the tradeoff between energy-efficiency and 
flexibility. 



7.2 Case Study II: Linear Algebra Kernels and Eigen- 
value Decomposition 

The second case study compares an optimized hand-programmable 
ASIP to a compiler-friendly, parameterizable general purpose proces- 
sor core, which both use the same application- specific accelerator. It is 
obvious, that a compiler-programmed processor enables a much faster 
design time at the expense of the performance and implementation 
efficiency compared to a hand-programmed ASIP. If this compiler- 
programmable processor can be parameterized in order to match the 
performance requirements of an application, however, this concept is 
useful for many applications. 

The target application are linear algebra kernels for communication 
applications. These kernels include typical complex matrix and vec- 
tor/matrix operations (cf. Appendix B.4). The specific benchmark 
for this case study is the eigenvalue/eigenvector decomposition (EVD) 
of a hermitian matrix* using a Givens-like decomposition algorithm 
(cf. Appendix B.5). Complex linear algebra and matrix decomposi- 

^However, the instruction set of the presented ASIP is not restricted to the EVD but also supports 
singular value decomposition of rectangular matrices and the vector-matrix operations mentioned above. 
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tion techniques are needed for various communication applications that 
use subspace decomposition e.g. direction of arrival (DOA) estimations 
[219] [151], beamforming [6], adaptive filter processing [105] and vec- 
tor quantization [279]. 

The two different design approaches that are compared in this chapter 
are 



• an optimized ASIP tailored to the given application by using spe- 
cialized instructions and a specialized data path together with a 
dedicated accelerator (“constructive ASIP design methodology”, 
which has been described in Section 5.3.6) 

• a processor core with a fixed instruction set, but a parameterizable 
number of functional units like multipliers, adders and memory 
units together with the same dedicated coprocessor as above (“pure 
library-based ASIP design methodology”) 

Both methodologies take advantage of the concept of a tightly coupled 
coprocessor. For this case study, the dedicated CORDIC coprocessor 
that has already been described in the example of Subsection 5.3.7 is 
used. This coprocessor is able to significantly reduce the computational 
load of both processors by mapping a regular computational part of the 
overall algorithm to dedicated hardware. 

The two implementations are described in detail in the following two 
subsections. Afterwards in Subsection 7.2.3, the evaluation of these 
implementations is presented using the eigenvector and eigenvalue de- 
composition of a 10x10 hermitian matrix as benchmark application. 



7.2.1 Implementation I: Optimized ASIP with Accelerator 

The algorithm for Givens-like eigenvalue and eigenvector decompo- 
sition is described in Appendix B.5. In addition to the trigonometric 
functions like sine, cosine, phase and magnitude of a vector, the EVD 
requires matrix-matrix-multiplications. These multiplications with the 
Givens-matrix are used to iteratively update both the matrix be- 
ing diagonalized, A„, and the matrix containing the approximate right 
(left) eigenvectors, 'E'Vr{i),n- Figure 7.5 depicts this matrix-matrix- 
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multiplication for the right eigenvector matrix update using a 4x4 matrix 
with the Pivot element (2,4) as example 
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Figure 7.5: Eigenvector Matrix Update 



For a real-world implementation, one multiplication in the Equa- 
tions B.12 and B.13 is immediately performed, after the Givens-matrix 
Gn is available, rather than storing the matrices and postponing this 
calculation. Due to the fact that the Givens -matrices G„ represent the 
identity matrix with the embedded 2x2 pivot submatrix Q„, the matrix- 
matrix-multiplication reduces to a multiplication with the matrix Q„. 

In contrast, the update of the matrix A„ can exploit additional arithmetic 
simplifications due to the symmetry of the hermitian matrix A„, which 
results in a significantly reduced number of arithmetic operations. 

A critical issue of high performance signal processing is memory orga- 
nization, because memories often represent a bandwidth bottleneck due 
to a small number of read and write ports (typ. 1 or 2 write ports are 
available). The degree of parallelism in the EVD algorithm is in the or- 
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der of the matrix dimension, which theoretically enables a full parallel 
solution with distributed parallel memory blocks or even registers. Un- 
fortunately, the hardware costs of this solution in terms of silicon area 
are proportional to the matrix dimension and do not justify the benefit 
in computational performance of this massive parallel approach for the 
given application constraints. In the current case study, the structure of 
the matrix updates suggests a dual-port RAM as main data memory for 
matrix computations, because 2 samples are processed in each compu- 
tation step. This design decision represents a trade-off between a full 
parallel and a scalar implementation. 




Figure 7.6: Computational Core for Matrix Updates 



A simple computational core that enables the necessary functionality 
for the matrix update functionality is depicted in Figure 7.6. This core 
needs registers for pipelining and to reduce memory accesses by storing 
the frequently needed values in in a register file. The eigenvector 
matrix EV„ is stored in the dual-port RAM. A schedule for the eigen- 
vector matrix update is depicted in Figure 7.7, which demonstrates the 
efficient use of the computational resources and the dual-port memory. 
An extended version of the computational structure in Figure 7.6 is used 
as part of the vector functional units in the final implementation of this 
case study. 

This final architecture of the ASIP has been named ICORE-II and sup- 
ports both scalar and vector instructions that match the properties of 
the application. Figure 7.8 depicts a simplified overview of the impor- 
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tant parts of ICORE-II: The main difference to the scalar ICORE are 
the vector functional units and the parallelized data memory. The vec- 
tor functional units are tightly coupled to the scalar part of the core by 
sharing the general purpose register file in order to save area and for 
efficient communication. Eurthermore, the decoder generates the paral- 
lelized control information by using an additional vector decoder, which 
is in turn controlled by a microcode sequencer. This sequencer supports 
multi-cycle instructions and controls the processing for vector lengths 
that exceed the available parallelism in the hardware. 
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Figure 7.7: Schedule for Matrix Updates 
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Figure 7.8: Simplified Overview of ICORE-II 



ICORE-II supports general vector instructions with a programmable 
vector length for scaling, dot-products and matrix-matrix additions and 
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multiplications. Additional application-specific instructions and ad- 
dressing modes have been implemented both for the scalar and the vec- 
tor part of the implementation addressing the EVD application: 

• instructions to support the CORDIC coprocessor 

• addressing modes to access row- and column-indexed ma- 
trix elements residing in the memory address base Jiddr ess + 
row -register * dim-register -f col-register 

• instructions to support the update operation for the matrix A and 
the eigenvector matrix EV 



7.2.2 Implementation II: Compiler-Programmed Parameteriz- 
able Core with Accelerator 

The considered parameterizable processor core of this section has been 
named AL/C£^ [271]. The architecture of this core, which is depicted in 
Figure 7.9, uses 5 pipeline stages (FEtch, DEcodel, DEcode2, EXecute 
and WriteBack). 

In the stage DE2 the general purpose register is read and the register out- 
put is routed to the input registers of the functional units in the EX stage. 
Due to the fact that the number of functional units is parameterizable, 
the necessary bandwidth for the control information has to be provided 
by an equally scalable instruction fetch stage. For the AEICE architec- 
ture, the number of fetch lanes has to be a power of two in order to en- 
able a simple addressing logic. There is one program memory lane and 
a lane decoder associated to each fetch lane. Instructions can be fetched 
in parallel using the concept of compressed VEIW encoding, which has 
been explained in Section 4.3.7. Thus, “no-operation” instructions do 
not have to be explicitly coded and the associated program memory 
locations can be saved for useful instructions. Figure 7.10 shows an 
example program, which takes advantage of the parallelism provided 
by AEICE using 4 fetch lanes. The parameters of AEICE that can be 
adjusted to the needs of an application are 

• the word width of the data path 



^Architecture for LISA Compiler Environment 
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Figure 7.9: Overview of ALICE Architecture 



• the number of parallel instruction memory lanes 

• the number of general purpose registers and the number of 
read/write ports 

• the forwarding configuration: enable/disable forwarding from WB 
to EX and from WB to DE2 

• the number of functional units including arbitrary accelerators and 
memory units 

• the branch behavior configuration: enable/disable branch delay 
slots 

The user has to take care to select a reasonable configuration that con- 
siders the mutual dependencies between some of the parameters men- 
tioned above: e.g. it does not make sense to instanciate several AEUs 
without providing sufficient instruction memory bandwidth by increas- 
ing the number of instruction memory lanes appropriately. 
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Figure 7.10: Example ALICE Program in Memory 



The question arises, if there is a differenee between ALICE and the eon- 
eept of a general ASIP. Indeed, there is one important distinetion: For 
the eurrent ease study the instruetions of ALICE are fixed, whieh means 
that the user does not have to eare about miero-arehiteetural details and 
optimizations^*’. As a replaeement of ISA optimization, ALICE sup- 
ports a wide variety of eonfiguration parameters, in order to seale the 
available arehiteetural parallelism. Furthermore, the use of aeeelerators 
with well-defined hardware/software interfaees” in the eore enable to 
take advantage of the high effieieney of dedieated hardware for regular 
arithmetie eomputations. Finally, ALICE represents an orthogonal ISA, 
whieh ean be effieiently targeted by HLL eompilers (ef. [271]) to speed 
up the design and verifieation proeess. 



7.2.3 Evaluation Results 

The two proeessor eores ICORE-II and ALICE have been implemented 
in VHDL and synthesized using a typieal 0.1 8/r teehnology. Power 
estimation has been performed with Synopsys’ DesignPower using the 
toggle information of gate level simulation. The results for these two 

*®It is obviously still possible to add optimized applications specific instractions to ALICE later on in the 
design flow. However, the primary goal of the current case study is to avoid this optimization for ALICE in 
order to evaluate the reduction in design time. 

^^The software interface is realized with fixed instructions similar to the concept of the ARM [16] or 
SPARC [228] instruction set with reusable accelerator move and control instructions. 




164 



Chapter 7 . Case Studies 



implementations with the EVD as benchmark application are depicted 
in Table 7.5. In this configuration, ALICE uses 4 parallel fetch lanes, 
2 ALUs, 2 multipliers and the same CORDIC coprocessor than ICORE- 
II as a special purpose unit. 

According to Table 7.5 the generic scalability of ALICE significantly 
increases the implementation area compared to the application-specific 
ICORE-II implementation. 

The more general purpose processor like ALICE processor consumes 
about one order of magnitude more in energy for the considered bench- 
mark applications than the application-specific optimized ICORE-II. 
This is due to the larger instruction memories for parallel instruction 
fetch, the significantly larger register file, as well as the additional logic 
for instruction expansion and instruction dispatch. Eurthermore, AL- 
ICE is deeply pipelined requiring many power intensive pipeline regis- 
ters and forwarding paths. This difference in energy consumption is a 
classical example for the tradeoff between energy and flexibility of an 
implementation. Eurthermore, this represents a tradeoff between the ar- 
chitectural efficiency and the design time: The higher design effort for 
ICORE-II results in higher performance, lower area and higher energy- 
efficiency. However, this design effort can be reduced by better tool 
support: Due to the microcode programmability of ICORE-II, the cur- 
rent LISA hardware generation capability could not be fully exploited 
resulting in an increased design time. Eor future designs it will be pos- 
sible to model the processor differently in order to overcome this draw- 
back. This approach would have reduced the design time of ICORE-II 
to about 5 weeks. 

It is interesting, that the benchmark runtime of the two applications 
are similar. This fact makes the library-based ALICE approach attrac- 
tive for applications, that require medium to high computational perfor- 
mance without tight energy and area constraints. 

Despite of the shorter design time of a library-based ASIP design ap- 
proach, the application-specific optimized processor results in a signif- 
icantly higher architectural and design efficiency. Eor applications with 
tight performance, energy and area constraints, this design approach is 
clearly superior to the library-based design methodology. 
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ICORE-II 


ALICE 


crit. path 


6.8ns 


4.9ns 


max. area (ND2 equ.) A 


34.5k 


98.2k 


avg. power 


47.2mW 


261mW 


avg. benchmark runtime T 


0.32ms 


0.58ms 


benchmark energy E 


15.10/iJ 


151.4pJ 


norm, area 


35.1% 


100% 


norm, power 


18.1% 


100% 


norm, benchmark runtime 


55.2% 


100% 


norm, benchmark energy 


9.97% 


100% 


design time Tdesign 


8 weeks (cf. text) 


3 weeks 


normalized architectural 
efficiency 1 / {ATE) 


100% 


1.93% 


normalized design 
efficiency 1/ {AT ETdesign) 


100% 


5.16% 



Table 7.5: Comparison Between ICORE-II And ALICE 



For higher computational performance requirements, multiple ALICE 
or multiple ICORE-II instances can be used in order to compute several 
orthogonal EVD tasks in parallel: This resulting design corresponds to 
a multi-processor implementation, which has not been further investi- 
gated in this thesis. 



7.3 Concluding Remarks 

The DVB-T A&T case study in Section 7.1 demonstrates the effect of 
best-practice ASIP design on the computational performance and the 
energy-efficiency of an implementation. The investigated optimization 
techniques include the selection of an appropriate ASIP instruction set 
class, iterative instruction set optimization and further energy optimiza- 
tions using logic reorganization, clock gating and automatic instruction 
encoding for energy minimization in the instruction memory. The re- 
sults of these optimizations show that an energy-efficiency gain of more 
than one order of magnitude can be achieved. 



^^This design time assumes that ALICE is a predesigned and verified library component with a verified 
parameterizable HLL compiler. 
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The case study in Section 7.2 compares a parameterizable processor 
core with a fully optimized ASIP targeting eigenvalue/eigenvector de- 
composition of hermitian matrices. The results clearly indicate, that 
the optimized ASIP is far superior to the parameterizable core concern- 
ing energy-efficiency and implementation area. On the other hand, the 
design time of the parameterizable processor with coprocessor outper- 
forms the fully hand-optimized ASIP. This decrease in design time can 
be achieved, because the parameterizable processor can be taken from 
a processor library together with a suitable parameterizable compiler. 
Obviously, the optimum design methodology strongly depends on the 
constraints of a given application: For design time critical projects the 
library-based approach is superior to the constructive ASIP design ap- 
proach. For designs requiring a high energy-efficiency the construc- 
tively designed ASIP is a much better choice. 

This case study emphasizes the importance of the HDL generation capa- 
bility of the LISA design environment. In order to exploit these capabil- 
ities, the LISA modeling style of complex processor architectures needs 
to be optimized using design guidelines in analogy to HLL or HDL 
coding guidelines. With such an optimized design style, the design effi- 
ciency of the constructive ASIP design approach can be increased using 
the LISA processor design environment. Obviously, this approach can 
be combined with library-based processor templates, which can be used 
as a starting point for optimization. This methodology combines the 
advantages of the constructive with the library-based ASIP design ap- 
proach. 




Chapter 8 



Summary 

Today’s ever-increasing complexities of embedded systems together 
with tightening time-to-market constraints are the primary drivers for 
new enabling technologies to enhance the design productivity. State of 
the art applications require more and more flexibility and functional- 
ity of embedded devices, which favors programmable implementations 
over dedicated hardware. For many handheld appliances like mobile 
phones and organizers, the battery runtime is almost as important as new 
functionalities. In previous publications [1] [92] it was demonstrated 
that high flexibility as well as high performance on the one hand, and 
high energy-efficiency on the other hand are competing goals. This fact 
motivates the exploration of new implementation paradigms that enable 
to trade-off these parameters to optimally satisfy the requirements of an 
application. 

In this thesis, the ability of application-specific instruction set proces- 
sors (ASIPs) to smoothly trade-off computational performance and flex- 
ibility for energy-efficiency is demonstrated. ASIPs are instruction set 
oriented processors with user-defined instructions, a user-defined data 
path and, optionally, a more dedicated user-defined accelerator. It is 
shown that higher performance and increased energy-efficiency can be 
obtained by exploiting application-specific optimization of the user- 
defined parts of the ASIP. This specialization removes the upper compu- 
tational performance bound of traditional fixed processor architectures 
by introducing the architecture as an additional degree of freedom in the 
design flow. This enables ASIPs to bridge the performance and energy- 
efficiency gap between inflexible dedicated hardware and general pur- 
pose processors. The quantitative evaluation of a case study shows an 
ASIP ATE-efflciency^ that is more than one order of magnitude better 
than the ATE-efficiency of a general purpose processor. 

A major obstacle of ASIP design is the larger design space compared 
to pure hardware or pure software implementations often resulting in a 



'This means the equally weighted efficiency for area, time (delay), and energy. 
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considerably longer design time, which is incompatible with short time- 
to-market constraints. This issue is the primary motivation of this thesis 
to identify the time critical tasks in the ASIP design flow and to develop 
a design methodology with the goal to speed up these tasks. 

The contribution of this thesis can be subdivided into the following two 
tightly related topics: 

• enhanced ASIP design flow to obtain a competitive time-to-market 
(optimum implementation efficiency) 

• ASIP design optimization for performance and low energy con- 
sumption 

(optimum architectural efficiency) 

One important challenge of ASIP design is the huge design space, 
which needs to be explored by the designer in order to obtain an op- 
timum implementation. This thesis proposes an ASIP design flow that 
reduces the design time by using the high level design entry language 
LISA. LISA has been developed at the Institute for Signal Processing 
Systems (ISS) together with tools to automate the generation of a com- 
plete software design tool chain for a given processor architecture. A 
LISA description uses a C-based abstraction level concerning the be- 
havior of single LISA operations paired with the concept of concurrency 
between different operations. This high level description for AS IPs al- 
lows design reuse for many product cycles and increases the design pro- 
ductivity. 

The design approach proposed in this thesis requires a synthesizable 
hardware description in the design iteration loop in order to track the 
impact of high-level decisions like instruction set modifications on im- 
portant low-level implementation parameters. Examples for these low- 
level parameters are the critical path, the energy consumption and the 
silicon area of the ASIP hardware. For this purpose, automatic hard- 
ware description generation is needed in order to reduce the time for 
one design iteration. This thesis contributes essential concepts for this 
new automatic hardware description generation tool by providing hand- 
optimized processor cores as case studies and references. Furthermore, 
critical design decisions for performance and energy consumption are 
quantitatively identified. Additionally, a concept to ease the develop- 
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ment of ASIPs using tool-based automatic instruction encoding is de- 
veloped. Finally, a methodology for the tedious verification of the final 
hardware description is presented and a semi-automatic test program 
generator supporting this approach is described. The proposed design 
approach is quantitatively evaluated with several case studies, and its 
efficiency is compared to an alternative library-only-based ASIP design 
flow without application-specific optimizations. The results of this case 
study clearly demonstrate that the proposed iterative design approach 
enables a competitive time-to-market. 

Architectural ASIP design optimizations are of paramount impor- 
tance in order to satisfy the constraints of an application. In this thesis, 
the design space for ASIP architectures is explicitly defined, thus, pro- 
viding the basis for any hardware design decision in the ASIP design 
flow. Moreover, architectural modifications are classified w.r.t. the im- 
pact on performance, energy consumption and silicon area in order to 
provide a sound basis for these critical design decisions. The rationale 
of increasing the energy-efficiency using typical ASIP specializations is 
explained in detail. Additional ASIP-typical energy optimizations are 
implemented and integrated in the design methodology, which substan- 
tially improve the total energy-efficiency by about one order of magni- 
tude. 

Certain other topics related to ASIP design methodology are beyond the 
scope of this thesis and are interesting for further research. The devel- 
opment of additional tools to automate certain ever-recurring ASIP op- 
timizations like instruction set specialization is critical to further reduce 
the design time for ASIPs. Furthermore, the quality of an automatically 
generated hardware description and of automatically generated compil- 
ers are essential factors for the efficiency of the final implementation 
and the success of this high level design approach. 
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ASIP Development Using 
LISA 2.0 



In this appendix, the language LISA 2.0, which is the basis of a uni- 
fied approach for all phases of Application Specific Instruction Set Pro- 
cessor (ASIP) design, is presented. These phases include architecture 
exploration, architecture implementation, software tools design and ar- 
chitecture integration. The work presented is the result of research at 
the Institute for Integrated Signal Processing Systems (ISS), Aachen 
University of Technology, headed by Prof. Heinrich Meyr, Prof. Gerd 
Ascheid and Prof. Rainer Leupers. This appendix reflects the current 
research status (October 2003), while major research work is ongoing 
in the field of compiler generation. Register Transfer Level (RTL) pro- 
cessor synthesis and system integration. The technology developed is 
commercialized by CoWare Inc. [61]. 



A.l The LISA 2.0 Language 

The open language LISA 2.0 [110][109] is aimed at the formalized 
description of programmable architectures, their peripherals and in- 
terfaces. It was developed to close the gap between purely structure- 
oriented languages (VHDL, Verilog) and instruction set languages for 
architecture exploration purposes. LISA provides a high flexibility 
to describe the instruction set of various processor types, such as 
SIMD, MIMD and VLIW-type architectures. Processors with complex 
pipelines or multi- threading can easily be modelled, too. 

Furthermore, LISA models may cover a wide range of abstraction lev- 
els. This comprises all levels starting at a pure functional abstraction 
modelling the data path of the architecture, to a model including the 
pipeline and functional units. In the domain of timing, the abstrac- 
tion can go from an instruction-accurate level to cycle- accurate or even 
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phase-accurate level. A working set of software tools can be generated 
from all levels of abstraction. Moreover, cycle-accurate models can be 
used to generate a RTL representation of the architecture. 

LISA architecture descriptions are composed of two main components: 
the resource definition in the so called RESOURCE section and the LISA 
operation tree consisting of several LISA operations. The RESOURCE 
section is a unique place to declare the resources of the architecture 
such as memories, buses, registers, pipelines and pins. The amount of 
information given in the RESOURCE section depends on the level of 
abstraction the model is dedicated for. For example, a pipeline is not 
specified in an instruction-accurate model. 
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Figure A.l: Extract of the LISA operation tree 



A LISA operation consists of various information and is the atomic 
element of the LISA operation tree. There are two main aspects 
which must be described explicitly by the LISA operation: the behav- 
ior and the instruction set. The behavior is described in the so called 
BEHAVIOR and EXPRESSION sections. While the EXPRESSION 
section simply returns a particular value, e.g. a register content, the 
BEHAVIOR section contains the state transition functions of the pro- 
cessor architecture. This state transition is described by writing C code. 
An instruction set is defined by its assembly syntax and its binary rep- 
resentation. These two pieces of information are described in the LISA 
SYNTAX and CODING section, respectively. 
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Additionally, a LISA operation may contain an ACTIVATION section, 
which describes the timing of the architecture by defining a chain of 
LISA operations to be executed. 

The LISA operations are organized in a tree-like structure. An exam- 
ple can be seen in Figure A.l. The behavior, coding and syntax of an 
instruction is distributed over several LISA operations. It starts at the 
root operation, which contains the basic information for all valid pro- 
cessor instructions. In this example, the separation into 16 bit and 32 
bit instructions is a first specialization. Each of those operations con- 
tains the relevant information for their instruction type. Accordingly, 
the operations representing the load/store instructions or arithmetic in- 
structions are further specializations of the instructions. The specializa- 
tion is the basic principle in developing a LISA model. Moreover, as 
can also be seen in Figure A.l, a LISA operation is not only used to 
represent a whole instruction but also a part of an instruction, such as 
opcode, operand or special condition field. Thus, developing a LISA 
model results in creating a LISA operation tree, that unifies the com- 
plete description of the behavior, syntax, coding and timing of the target 
architecture. 

The LISA language allows to describe hierarchical models which guar- 
antees modularity and reusability. Architecture models can easily be 
modified or adopted to new processors, which is the basis of a success- 
ful and fast design space exploration. 



A.l Design Space Exploration 

The key factor of designing ASIPs is an efficient design space explo- 
ration phase. The LISA language allows to apply changes to the archi- 
tecture model quickly as the level of abstraction is higher than RTL. As 
shown in Figure A.2 a LISA model of the target architecture is used 
to automatically generate software tools such as C-compiler, assembler, 
linker and simulator. These software tools are used to profile and mod- 
ify both architecture and application. This exploration loop is repeated 
until a sufficient cost/performance ratio is reached. 

Although the higher level of abstraction is the basic reason for the suc- 
cess of Architecture Description Languages (ADLs), the link to the 
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physical parameters such as chip area, power consumption or clock 
speed gets lost. Ignoring physical parameters in the design space ex- 
ploration phase leads to suboptimal solutions or long redesign cycles. 
The necessity of combining the high level abstraction and physical pa- 
rameter evaluation during design space exploration is compelling. 

To overcome those limitations, as shown in Figure A. 2, a complete hard- 
ware model is automatically generated from LISA in order to get a pre- 
liminary estimate about the clock speed, area and power consumption. 
The LISA processor design platform takes the gate-level synthesis re- 
sults into account during the exploration phase. The LISA model is used 
to derive a fully synthesizable model on RTL. Compared to other ASIP 
development approaches the designer is able to perform this synthesis 
flow without being restricted to fixed RTL components. 

If the synthesis results of the generated architecture fulfill the given 
physical constraints, then the hardware model can even be used for the 
final architecture implementation. As the datapath is often highly op- 
timized and based on in-house IP, it may be replaced by the designer 
manually. This is shown in Figure A. 2 on the right hand side. 




Figure A.2: Exploration and Implementation based on LISA 
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A.3 Design Implementation 

The LISA model is used to derive the complete target architecture in 
form of a Hardware Description Language (HDL) [217] [112]. Lan- 
guages supported are VHDL, Verilog and SystemC. As described in 
Section A. 2, the generated model is used for design exploration and 
implementation. 

The synthesized architecture consists of several entities. The base entity 
instantiates one entity which groups all registers, one entity for all mem- 
ories and another for the complete pipeline. The pipeline entity again 
consists of entities representing every pipeline stage and intermediate 
pipeline registers. This entity also contains the automatically generated 
pipeline controller. The entities representing each pipeline stage instan- 
tiates the final level of hierarchy - the entities for the functional units, 
such as ALUs, Address Generation units, etc. Moreover, the generated 
decoder is placed inside the pipeline stage entities. 

The elements, which constitute the control path, are the instruction de- 
coder and the pipeline controller. The decoder may be distributed over 
several pipeline stages and sets the control signals to the functional 
units, to initiate the execution of those. They also steers the pipeline 
controller. The pipeline controller gathers several information, such as 
signals from the decoder or the status of the processor, and sends appro- 
priate and stall signals to the pipeline registers. 

The decoder generation requires various information about the target ar- 
chitecture. Both, the RESOURCE section as well as the LISA operations 
are used to derive the decoder and control-path. In fact, detailed infor- 
mation about the instruction coding (CODING section) and the timing 
(ACTIVATION section) is extracted from the LISA operation tree. 

A single LISA operation is assigned to a dedicated pipeline stage. The 
behavior of a software instruction, for example the instruction add, 
is distributed over several different LISA operations as shown in Fig- 
ure A.3 : 



• The operation decode is assigned to the DE stage. This operation 
loads the operands from a general purpose register into a pipeline 
register. 
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• The operation addition, whieh adds the values in the pipeline 
registers and writes the result baek to another pipeline register. 
This operation is assigned to the EX stage 

• The operation writeback, whieh is assigned to the WB stage, 
writing the value from the pipeline register to the general purpose 
registers. 



The operation exeeution depends on the LISA timing model. As in- 
dieated by the arrows on the left side of Figure A. 3, LISA operations 
aetivate other LISA operations, whieh are exeeuted aeeording to their 
spatial delay in the pipeline. These aetivation sequenees are translated 
to eontrol signals in the HDL model, whieh are set or reset depending 
on the instruetion eoding of the respeetive LISA operation. 

Deeoders are generated in eaeh stage, where aetivation signals start. 
Thus, the timing of the arehiteeture is reprodueed in the HDL model 
and the designer might influenee the resulting hardware direetly via the 
LISA model. In this example, two deeoders are generated, one in the 
DE stage and another in the EX stage. If the aetivation sequenee is 
ehanged in sueh a manner that the deeode operation aetivates all other 
LISA operations, only one single deeoder in the DE pipeline stage will 
be generated. 




Figure A.3: LISA operation tree and decoder generation 
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A.4 Software Tools Generation 

The software tools generated from the LISA deseription are able to 
eope with the requirements of eomplex applieation development. From 
LISA, C-eompiler, assembler, linker and simulator are generated. The 
automatie C-eompiler generation is eurrently one of the major researeh 
topies. 



A.4.1 Compiler Generation 

The shift from assembly to the C programming language for applieation 
development is ongoing. This move is driven by the faet that most DSP 
algorithms are realized using the C language. Considering the various 
eonfigurations of an ASIP, even during design spaee exploration, the 
automatie re-targeting of a eompiler is highly desired [267]. For this 
reason, the automatie generation of a C-eompiler eame strongly into fo- 
eus reeently. 

For retargeting a eompiler, the arehiteeture speeifie baek-end of a eom- 
piler must be adjusted or rewritten, whereas the arehiteeture indepen- 
dent frontend and most of the optimizations are kept unehanged. There- 
fore, a retargetable eompiler platform is employed whieh reads in a set 
of deseription files generated from the LISA deseription to build the 
eompiler. 

All relevant information for eompiler generation are derived from the 
LISA model. While some information is explieit in the LISA model 
(e.g. via resouree deelarations), other relevant information (e.g. eon- 
eerning instruetion seheduling) is only implieit and needs to be ex- 
traeted by speeial algorithms. Some further, heavily eompiler- speeifie, 
information is not at all present in the LISA model, e.g. C type bit 
widths. Thus, eompiler information is automatieally extraeted from 
LISA whenever possible, while GUI-based user interaetion is employed 
for other eompiler eomponents. The GUI reads the LISA model and 
presents all relevant maehine features (e.g. resourees and maehine op- 
erations) for whieh interaetion is required to the user for further refine- 
ment. 

The eompiler baekend basieally eonsists of a register alloeator, instrue- 
tion seleetor, seheduler and eode emitter. Apart from that, the ealling 
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conventions and the stacklayout have to be configured. The GUI guides 
the designer through the specification of the different components: 

• Purely numerical parameters (such as C type bit widths, type align- 
ments, minimum addressable memory unit size) are directly cap- 
tured by means of GUI tables. 

• Calling conventions (i.e. how arguments are passed/retumed 
to/from functions) are also captured with GUI tables. 

• For the supported stack layout, the designer has to specify the stack 
pointer and the frame pointer whereas other configuration items 
can be simply selected/deselected. 

• Retargeting the register allocator is reduced to the selection of al- 
locatable registers out of the set of all available registers in the 
LISA model. 

• The scheduler and the code emitter is generated fully automati- 
cally [270] [268] [269]. 

• The code selector rules are specified by means of a convenient 
drag-and-drop mechanism: The user can compose these rules from 
the compiler operators (e.g. addition). Like in most compilers, 
these mapping rules are the basis for the tree pattern matching 
based code selector. The link between mapping rules and their ar- 
guments on the one hand and LISA operations and their operands 
on the other hand is made via drag-and-drop in the GUI. 

Once everything is specified the designer can finally build the compiler 
within minutes. 



A.4.2 Assembler and Linker Generation 

The generated assembler [111] processes the assembly application and 
produces object code for the target architecture. An automatically gen- 
erated assembler is required, as the modelled architecture consists of 
a specialized instruction set. Certainly, the common assembler features 
are also supported in the generated assembler. For example, many GNU 
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assembler directives are supported. A comfortable macro assembler ex- 
ists to provide more flexibility to the designer. 

The different pieces of object code are linked by the automatically gen- 
erated linker. With respect to the modelled memory configuration the 
object code is used to create the final executable. Various configuration 
possibilities are provided to steer the linking process. 



A.4.3 Simulator Generation 

The generated simulator is separated into backend and frontend. The 
debugger frontend and profiler is shown in Figure A.4. It supports ap- 
plication debugging, architecture profiling and application profiling ca- 
pabilities. The screenshot shows some features such as disassembly 
view (1) including loop and execution profiling (2), LISA operation ex- 
ecution profiling (3), memory profiling (4) and LISA operation code 
coverage (5). Also, the content of memories (6), resources (7) and reg- 
isters (8) can be viewed and modified. Thus, the designer is able to 
easily debug both the processor model and application. Additionally, 
the necessary profiling information for design space exploration is pro- 
vided. 

The performance of the simulator is strongly dependent on the abstrac- 
tion level of the underlying LISA model and the memory model. Fig- 
ure A. 5 shows the ranges of simulation speed achieved by the simulators 
generated from LISA. The results were achieved by using a 2000 MHz 
Athlon PC, 768 MB RAM running the Red Hat Linux operating sys- 
tem. The simulation speed of a LISA model, written on a high level 
of abstraction, both in the domain of timing and architectural features, 
reaches up to 15 Million Instructions Per Second (MIPS). After increas- 
ing the model accuracy, by changing the memory to a complex memory 
subsystem, the simulation speed drops to 8 MIPS. Changing the core 
model to a pipelined and thus cycle-accurate version without touching 
the memory model, decreases the simulation speed by 10 MIPS. Finally, 
simulating a very detailed model close or equal to the real hardware be- 
havior, the simulator still achieves a speed of about 0,5 MIPS. 

The simulator backend, includes a well defined Application Program- 
ming Interface (API), which can be easily used to connect to any other 
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Figure A.4: The simulator and debugger frontend 




Figure A.5: Achieved simulation speed 



simulator frontend. Various simulation techniques [35] are supported, 
such as compiled simulation, interpretive simulation and Just-In-Time 
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Cache Compiled Simulation (JIT-CCS) [193]. These mechanisms are 
briefly described below. 



A.4.3.1 Interpretive Simulation 

The interpretive simulation technique is a software implementation of 
the underlying decoder of the architecture. For this reason the inter- 
pretive simulation is considered to be a virtual machine performing the 
same operations as the hardware does: fetch, decode and execute the in- 
struction. All simulation steps are performed at runtime, which provides 
the highest possible flexibility. However, the straight-forward mapping 
of the hardware behavior into a software simulator is the major dis- 
advantage of the interpretive simulation technique. Compared to the 
decoding of the real instructions in hardware, the control flow requires 
an significant amount of time in software. 



A.4.3.2 Compiled Simulation 

The compiled simulation uses the locality of code in order to speed up 
the execution time of the simulation compared to the interpretive sim- 
ulation technique. The task of fetching and decoding an instruction is 
performed once before simulation run. The decoding results are stored 
and used later on during simulation. Execution time is saved as, during 
the following executions of the same instruction, the fetch and decode 
steps do not need to be repeated. Thus, the compiled simulation re- 
quires the program memory content to be fixed before simulation run- 
time. Various scenarios are unsupported by the compiled simulation 
technique, such as system simulations with external and thus unknown 
memory content and operating systems with changing program memory 
content. Additionally, large applications, which require a huge amount 
of memory on the target host, are hard to support. 



A.4.3.3 Just-In-Time Cache Compiled Simulation (JIT-CCS) 

The objective of the JIT-CCS is to combine the advantages of both in- 
terpretive and compiled simulation. This new technique provides the 
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Figure A.6: Performance of the just-in-time cache compiled simulation 



full flexibility of the interpretive simulation while reaehing the perfor- 
manee of the eompiled simulation. The underlying prineiple is to per- 
form the eompilation, in faet the deeoding proeess, just-in-time at sim- 
ulation runtime. Beeause of that, full flexibility is provided. Moreover, 
the deeoding results are stored in a eaehe. In every subsequent simu- 
lation step the eaehe is searehed for already existing deeoding results. 
Due to the loeality of eode in typieal applieations, the simulation speed 
ean be improved using the JIT-CCS. The eaehe size used in the JIT-CCS 
is variable and ean be ehanged in a range from 1 - 32768 lines, where 
a line eorresponds to one deeoded instruetion. The maximum amount 
of eaehe lines eorresponds to a memory eonsumption of less than 16 
MB on the simulator host. Compared to the traditional eompiled sim- 
ulation teehnique, where the eomplete applieation is translated before 
simulation time, this memory eonsumption is negligible. 

Figure A.6 illustrates the performanee of the eaehe eompiled simulation 
depending on the eaehe size. The results were aehieved by using an 
1200 MHz Athlon PC, 768 MB RAM running the Mierosoft Windows 
2000 operating system. A eaehe size of one line means, that the Just-In- 
Time eaehe eompiled simulation essentially performs the same way as 
the interpretive simulation. Every instruetion is deeoded and simulated 
again, without using the advantage of eode loeality. With a rising num- 
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ber of cache lines the simulation performanee (wide bars) eomes eloser 
to the performanee of a eompiled simulation and even reaehes that per- 
formanee. The simulation speed inereases with deereasing eaehe miss 
rate (narrow bars). As ean be seen in the figure, the performanee of 
the eompiled simulation ean be reaehed with a relatively small number 
of eaehe size and thus less memory eonsumption on the host maehine. 
Moreover, this memory eonsumption on the target host is eonstant rela- 
tive to the applieation size. 



A.5 System Integration 

Today, typieal single ehip eleetronie system implementations eombine 
a mixture of DSPs, miero-eontrollers, ASICs and memories. In future, 
the number of programmable units in a System-on-Chip (SoC) design 
will even inerease. To handle the enormous eomplexity, system level 
simulation is absolutely neeessary for both performanee evaluation as 
well as verifieation in system eontext. The earlier in the flow design 
errors or laek of performanee are deteeted, the less the eosts for re- 
design eyeles get. The automatieally generated LISA proeessor simu- 
lators ean be integrated into various System Simulation Environments, 
sueh as CoWare ConvergenSC[61] or SYNOPSYS CoCentrie System 
Studio[234]. Thus, modules provided by different design teams or even 
third parties ean be eombined easily. 

The eommunieation of the LISA proeessors with their system environ- 
ment ean be modelled on different levels of abstraetion. First, LISA 
pin resourees ean direetly be mapped to the SoC environment for pin 
aeeurate eo-simulation. Alternatively, the LISA bus interfaee allows 
modelling the SoC eommunieation on a higher abstraetion level, i.e., 
Transaetion Level Modeling (TLM) [239]. By that, aeeesses to buses 
and memories external to the respeetive proeessor eore are effieiently 
mapped to the eommunieation primitives applied in the SoC simulation 
world. 

For user friendly debugging and online profiling of the embedded SW 
and its platform, the user always has the possibility of getting the full 
SW eentrie view of an arbitrary SW bloek[274] at simulation runtime. 
This is done just by dynamieally eonneeting a HUB debugger GUI to 
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the processor of interest. Thus, only one or two debugger GUIs are suf- 
ficient even to debug complex multiprocessor systems. All other SW 
blocks, that are currently not considered, still simulate at maximum 
speed. The remote debugger frontend instance offers all observabil- 
ity and controllability features for multiprocessor simulation as known 
from standalone processor simulation. Even resources external to a pro- 
cessor module but mapped into its address space like peripheral regis- 
ters and external memories can be visualized and modified by the multi- 
processor debugger GUI. The SW developer can dynamically connect 
to relevant processors, set break/watch points in respective code seg- 
ments, disconnect from simulation and automatically re-connect when 
a breakpoint is hit. 



A.6 Summary 

This appendix presents the development of Application Specific In- 
struction Set Processors (ASIPs) based on the architecture description 
language LISA. This includes the design exploration, implementation, 
software tools design and system integration. The LISA model of the 
target architecture is used to automatically generate the software tools: 
C-compiler, assembler, linker and simulator from the same LISA model. 
Given a cycle- accurate LISA model, even the complete hardware model 
can be derived both for exploration and implementation purpose. The 
generated software tools are powerful enough to be used in complex ap- 
plication design as powerful assembler, macro assembler and different 
simulation techniques are provided. The highly flexible system integra- 
tion allows to connect to any co-simulation environment or customer 
specific environments. The current research topics are focusing on the 
field of compiler generation, RTL processor synthesis and system inte- 
gration. 
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Computational Kernels 



In this chapter, the computational kernels that have been used for il- 
lustration purposes in Chapter 5 and as benchmarks in Chapter 7 are 
described. 



B.l The CORDIC Algorithm 



The CORDIC algorithm' for the vectoring and rotate mode was first 
described by Voider [266]. In the vectoring mode the magnitude and 
the angle of a given vector are computed, whereas in the rotate mode a 
given vector is rotated by a given angle. 

The CORDIC algorithm uses iterative computations according to the 
following equations: 





= XiT‘2-^Yi 


(B.l) 




= 


(B.2) 


O-i+l 


= cti =F 


(B.3) 



These iterations are valid for Xi > 0, otherwise, the vector with a nega- 
tive Xi has to be rotated by 180°. The pathological case of a zero vector 
has to be treated as an exception. 

After N iterations the magnitude of the resulting vector {X^, Y^) has 
been incremented compared to the start vector (Xq, Tq) by a factor 
which can be (pre-)computed using the following equation: 



i=N-l 

Kn= Yl Vl + 2-2* (B.4) 

i=0 



* CORDIC stands for Coordinate Rotation Digital Computer. 
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If the length of the resulting vector in the rotate mode has to be pre- 
served, the scaling factor needs to be applied. 

The strategy for the choice of the signs in equations B.l to B.3 has to 
be selected according to the operating mode. 



• rotate mode: in order to obtain oat — 0 the Ictj+i | has to be smaller 
than the previous \ai\, therefore, the upper sign is chosen, if Oj > 
0, else the lower sign. 



• vectoring mode: in order to obtain Yj^ 0 the has to be 

smaller than the previous \Yi\, therefore, the lower sign is chosen, 
if Fj > 0, else the upper sign. 



Listings B.l and B.2 are C code implementations of the CORDIC both 
for the vectoring and the rotate mode. 



void cordic_vect ( long *x,long y,long *z, const long N) { 
long x_next, y_next, z_next, delta, i, flag; 
double K; 

*z=0; K=1.0; 
if (*x >= 0 ) 
flag - 1; 
else { 

*x=-{*x) ; y=-y; flag - -1; 

} 

for(i = 0; i<=N; i-i- + ) { /* "Vectoring" mode y->0 */ 

if {y>=0) delta = -1; else if (y<0) delta = 1; 
x_next = (*x) -delta* (y>>i) ; 
y_next = y-i-delta* ( (*x) >>i) ; 
z_next = (long) (*z-delta* (long) ((1<<N) 

*atan (1.0 / ( (float ) (l<<i)))-i-0.5)); 

*x = x_next ; y = y_next ; *z = z_next; 
K=K*1.0/sqrt(l-i-pow(2.0, -2.0*i) ) ; 

} 

if (flag -- -1) 

if (*z < 0) (*z) -i-= (long) ( ( 1<<N) *M__PI ) ; 

else (*z)-=(long) ( ( 1<<N) *M_PI ) ; 

*x=(long) (K*(*x)); /* scaling */ 

} 



Listing B.l: CORDIC Implementation for Vectoring Mode 






B.2. FIR Filter 



187 



void cordic_rot (long *x,long *y,long z, const long N) { 

long x_next, y_next, z_next, delta, i, flag; 
double K; 

K=1.0; 

if (z< (long) (- (1<<N) *0.5*M_PI) ) { 
flag = 1; 

z+=(long) ( (1<<N) *M_PI) ; 

} 

else if(z> (long) ( (+ (1<<N) *0 . 5*M_PI) ) ) { 
flag - 1; 

z-=(long) ( (1<<N) *M_PI) ; 

} 

else 

flag - 0; 

if ( (z<- (1<<N) *0 . 5*M_PI) II (z> (1<<N) *0.5*M_PI) ) { 

printf ( " \n\nError in CORDIC subroutine: z out of range!\n"); 
printf("z = %ld\n", z) ; 

printf ( "Bounds : %ld %ld\n" , (long) ( - ( 1<<N) * 0 . 5*M_PI ) , 

(long ) ( (1<<N) *0.5*M_PI) ) ; 

exit ( 1 ) ; 

} 

for(i=0; i<=N; i++) { /* "rotate" mode z->0 */ 

if (z>=0) 
delta - 1; 
else if (z<0) 
delta - -1; 

x_next = *x-delta* (*y>>i) ; 
y_next = *y+delta* ( *x>>i ) ; 
z_next = (long) ( z -delta* ( long) ((1<<N) 

*atan (1.0 / ( (float) (l<<i)))+0.5)); 

*x = x_next ; 

*y = y_next ; 
z = z_next ; 

K=K*1 . 0/sqrt (l+pow(2 .0, -2.0*i)); 

} 

if (flag == 1) { 

*x=- (*x) ; 

*y=- (*y) ; 

} 

*x=(long) (K*(*x)); /* scaling */ 

*y= (long) (K* (*y) ) ; 

} 



Listing B.2: CORDIC Implementation for Rotate Mode 

B.2 FIR Filter 

FIR filters are important DSP kernels for a variety of applications. Many 
commercial DSPs using multiply-accumulate units are optimized for 
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these FIR kernels. The following equation defines the behavior of an 
M-tap-FIR filter: 

k=M-l 

y(n) = hkx{n - k) (B.5) 

fc =0 

where x{m) is the input, are the eoeffieients and y{n) is the output 
of the filter. In Listing B.3 a C eode implementation for the FIR filter is 
given, whieh has been taken out of the DSPstone benehmark program 
suite [285]. This implementation uses explieit memory eopy operations 
in order to obtain the eorreet delay for the inputs x{n — k). Provided 
that a proeessor supports modulo addressing [275], this delay line ean 
also be implemented using a eireular buffer in the memory, whieh ap- 
proximately halves the number of memory aeeesses. 



B.3 The Fast Fourier Transformation 

The fast fourier transformation [57] is an algorithm to eompute the dis- 
erete fourier transformation (DFT) at redueed eomputational eosts. The 
8192 point radix 2 FFT implementation in Listing B.4 uses the deeima- 
tion in time algorithm [154]. The eomplex eoeffieients are partially pre- 
eomputed, whieh saves memory bandwidth at the expense of additional 
arithmetie eomputations. The funetion ReverseBits() in Listing B.4 is 
needed to reverse the bit order of an integer for addressing purposes. 



B.4 Vector/Matrix Operations 

The following veetor and veetor-matrix operations have been eonsid- 
ered: 

i=N-l 

• dot produet: 2 : = xiHi 

i=0 

• matrix-veetor multiply: Z = and Z = XL 

• matrix-matrix multiply: Z = XY, Z = XY^ and Z = X^Y 
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#define STORAGE CLASS register 


#define TYPE int 




#define LENGTH 64 




void 




pin down (TYPE * px, TYPE * 
; 


ph, TYPE y) 


i 

STORAGE CLASS TYPE i; 




for (i = 1; i <= LENGTH; 


i + + ) 


{ *px-M- = i; 




*ph-i-+ = i; 

} 

) 




j 

TYPE mainO 
r 




1 

static TYPE x [LENGTH] ; 




static TYPE h [LENGTH] ; 




static TYPE xO = 100; 




STORAGE CLASS TYPE i ; 




STORAGE CLASS TYPE *px, 


*px2 ; 


STORAGE CLASS TYPE *ph ; 




STORAGE CLASS TYPE y; 




pin down(x, h, y) ; 




ph = &h [LENGTH- 1] ; 




px = &x [LENGTH- 1] 




px2 = &x [LENGTH- 2] 




// START PROFILING ; 




y = 0; 




for (i = 0; i < LENGTH - 


1; i++) 


{ y += *ph-- * *px ; 




*px-- = *px2-- ; 

\ 




/ 

y += *ph * *px ; 




*px - xO ; 




// END PROFILING ; 




pin down(x, h, y) ; 




return { (TYPE) y) ; 




} 





Listing B.3: Implementation of 64 tap FIR Filter (including Testbench) 



• basic element-wise arithmetie: Z = X op Y and z = x op y where 

“op” is one of +, * 

• veetor load/store operations (these also support the generation of 
regular matriees like the identity matrix using programmable stride 
lengths) 

• load/store operation of element in row n and eolumn m in NxM 
matrix 
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B.5 Complex EVD using a Jacobi-like Algorithm 



According to [134] for the singular value decomposition (SVD) as well 
as for the eigenvector/eigenvalue decomposition^ (EVD) there are two 
algorithms that are widely used: Jacobi-like [128] algorithms and QR- 
factorization-based algorithms [94]. For this case study, a Jacobi-like 
algorithm using modified Givens -rotations is used, because of the better 
numeric stability and precision compared to the QR methods [63]. 

In order to decompose a hermitian NxN matrix A = Aq into the (real) 
eigenvalues and complex eigenvectors the Givens -rotations are used as 
follows: After each multiplication according to the following equation, 
a Pivot element a* ^ of the hermitian matrix A„ is canceled: 

An+i = G^^A„G„ (B.6) 



The matrix G„ is the modified Givens-matrix that is computed using a 
NxN identity matrix into which a 2x2 pivot submatrix Qi j is embedded 
at the positions (i,i), (i,j), (j,i) and (j,j) as follows (i < j): 



/I 



G„ = 






Qi,i Qi,j 

1 

^j,i ^j,j 



1/ 



This 2x2 submatrix Qi j is given by 



n — I ^ e sm p 

' — e“*"sin p cos p 



where 



and 






I 



2|aj d 

tan 2p = 



CL j j CL'^ 



—7T 7T 

— < p < — 

4 - ^ - 4 



(B.7) 



(B.8) 



(B.9) 

(B.IO) 



^The EVD is a special case of the SVD. 




B.5. Complex EVD using a Jacobi-like Algorithm 



191 



Successive Givens-rotations have to be performed for all off-diagonal 
elements of A, whieh is eommonly referred to as one sweep. It is typ- 
ieally neeessary to perform several sweeps in order to reaeh a given 
preeision, beeause the Givens-rotations set Pivot elements that have al- 
ready been eaneeled by a previous rotation baek to a value different 
from zero^. After a eertain preeision is reaehed after M rotations, whieh 
ean be monitored by eomputing the energy of the off-diagonal elements 



row=N col=N 

Eoff_diag = 

row=l col=row-\-l 



E 



^row.col 



(B.ll) 



the eomputed real eigenvalues A* are an approximation of the diagonal 
elements of the matrix A^^. 

The assoeiated right and left eigenveetors are given by the orthogonal 
matriees E V,. and E V i whieh have normalized rows and eolumns ae- 
eording to 

M 

EV, = (B.12) 

i=l 

and 

M 

EVz = nG;' (B.13) 

i=l 



The eomputations in Equation B.9 and Equation B.IO ean obviously be 
implemented with a CORDIC proeessor, whieh enables to eompute the 
phase and magnitude of a veetor as well as the sin() and eos() funetions 
of a given angle. 



^However, the magnitude of this value is smaller than the magnitude of the value that has just been 
canceled. Therefore, the algorithm converges to a diagonal matrix. 
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Appendix B. Computational Kernels 



#define DP 10 // decimal point of fixed point numbers 
void fft_ll ( 

int InverseTransf orm, long *RealIn, long *lmagln, 

long *RealOut, long *ImagOut ) 

{ 

unsigned i, j, k, n; 
unsigned BlockSize, BlockEnd; 

constant unsigned NumSamples = 8192; /* FFT-length */ 
constant unsigned NumBits=13; /* No. of bits to store indices */ 
NumBits - 13; 

long tr, ti; /* temp real, temp imaginary */ 

// delta_angle; 

long sm2_arr[13] = {O, 0, -1023, -724, -391, -199, -100, -50, 

-25, -12, -6, -3, -l}; 

long sm2, sml_arr[13] = {O, -1023, -724, -391, -199, -100, -50, -25, 
-12, -6, -3, -1, O}; 

long sml, cm2_arr[13] = {1023, -1023, 0, 724, 946, 1004, 1019, 1022, 
1023, 1023, 1023, 1023, 1023}; 

long cm2, cml_arr[13] = (-1023, 0, 724, 946, 1004, 1019, 1022, 1023, 
1023, 1023, 1023, 1023, 1023}; 
long cml , w, ar[3], ai[3], tmp; 
int loopcnt; 

for ( i=0; i < NumSamples; i++ ) { 

j = ReverseBits ( i, NumBits ) ; 

RealOut[j] = Realin [i] <<DP; 

ImagOutEj] = {Imagin == NULL) ? 0 : Imagin [i] <<DP; 

} 

// START_PROFILING ; 
loopcnt=0; BlockEnd = 1; 

for ( BlockSize = 2; BlockSize <= NumSamples; BlockSize <<= 1 ) 

{ 

sml = sml_arr [loopcnt] ; sm2 = sm2_arr [loopcnt] ; 
cml = cml__arr [loopcnt] ; cm2 = cm2_arr [loopcnt] ; 
w = (2 * cml) ; loopcnt++; 

for ( i=0; i < NumSamples; i += BlockSize ) { 

ar[2] - cm2; ar[l] - cml; 
ai[2] - sm2 ; ai [1] - sml; 

for ( j=i, n=0; n < BlockEnd; j++, n++ ){ 

ar [0] = (w*ar [1] ) >>DP-ar [2] ; ar [2] =ar [1] ; ar [1] =ar [0]; 
ai [0] = (w*ai [1] ) >>DP-ai [2] ; ai [2] =ai [1] ; ai [1] =ai [ 0] ; 
k = j + BlockEnd; 

tr = {ar [0] *RealOut [k] ) >>DP - (ai [0] *ImagOut [k] ) >>DP; 
ti = (ar [0] *ImagOut [k] ) >>DP + (ai [0] *RealOut [k] ) >>DP; 
RealOut [k] = (RealOut [j ] -tr) ; ImagOut [k] = (ImagOut [j ] -ti) ; 
RealOut [ j ] = (RealOut [ j ] +tr) ; ImagOut [ j ] = ( ImagOut [ j ] +ti ) ; 

} 

} 

BlockEnd = BlockSize; 

} 

unsigned ReverseBits ( unsigned index, unsigned NumBits ) 

{ 

unsigned i, rev; 

for ( i=rev=0; i < NumBits; i++ ) { 

rev = (rev << 1) | (index & 1) ; 

index >>= 1; 

} 

return rev; 

} // END_PROFILING ; 

} 



Listing B.4: Implementation of an 8192 point FFT 




Appendix C 



ICORE Instruction Set 
Architecture 



This chapter is organized as follows: First of all, the ICORE proees- 
sor pipeline organization is deseribed and an overview of the important 
proeessor resourees is given. Furthermore, the proeessor instruetions 
as well as exeeptions to the orthogonal instruetion exeeution model are 
diseussed A deseription of the memory and EO organization as well as 
the ICORE approaeh to instruetion eoding eoneludes this ehapter. 



C.l Processor Resources 



The visible proeessor storage entities for the programmer (ef. Fig- 
ure C.l) are the general purpose register file (8x32bit registers), the ad- 
dress registers (4x9bit), the status register (with less-than and zero flag) 
and the predieate registers (4xlbit, used as storage bits for eonditions). 
These resourees are abbreviated in the following seetions aeeording to 
table C.l in order to simplify the notation. 

The following instruetion deseriptions use a C-like notation to speeify 
the instruetion behavior. Example: AREG=IMM means, that the imme- 
diate value ”IMM” (whieh is taken from the instruetion word) is loaded 
into address register ”AREG”, where ”AREG” denotes one of the ad- 
dress registers ARO to AR3. 



C.l Pipeline Organization 

ICORE uses a 3 stage pipeline whieh is depieted in figure C.2. The 
first pipeline stage is the stage, where the instruetion word is fetehed 
from program memory (“FETCH INSTRUCTION”). The address for 
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IF 



ID 



RD/EX/WB 




Figure C.l: ICORE Architecture 



the program memory is taken from the program eounter PC. The pro- 
gram ROM in this stage is eloeked by the falling eloek edge, whereas 
all the other registers are eloeked by the rising eloek edge. Thus, given 
a eertain value for the PC (e.g. PC=0xl00), the instruetion (in this ease 
the instruction of ROM address 0x100) is stored into the fetch register 
during the next rising clock edge. This is convenient, because no addi- 
tional pipeline delay is introduced by the ROM itself. In the next stage 
(“DECODE INSTRUCTION”) the instruction is decoded and internal 
control signals are generated and stored in the decode register. These 
signals are propagated to all the functional units of the core and control 
the behavior of the decoded instruction. 
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Resource 


Abbreviation 


REG 


one general purpose register (R0-R7) 


REGS 


source register (which is read) 


REGD 


destination register (which is written) 


AREG 


one address register (AR0-AR3) 
for indirect addressing 


IMM 


immediate value (constant value encoded 
in the instruction itself) 


PEG 


one predicate bit (PR0-PR3) 


PC 


program counter 


STACK 


stack for subroutine return address 


MEM 


data memory 


lOPORT 


input/output register space 


HWLJLOOP.CNT 


internal register used as 

loop counter for the zero-overhead loop 



Table C.l: Processor Resources and Abbreviations 



FETCH 
I INSTRUCTION I 



Program Counter 



DECODE 

INSTRUCTION 



Fetch Register 



EXECUTE 

INSTRUCTION 



Decode Register 



Processor i 
States 



Data 

Memory 

and 

I/O 



Figure C.2: Abstract ICORE Pipeline Organization 



The EXECUTE INSTRUCTION pipeline stage reads and updates the 
registers storing the proeessor states (general purpose registers, status 
register, predieate registers). The most important operations in this 
pipeline stage are 

• register-register operations, reading a general purpose register, ex- 
eeuting an operation, and writing baek the result to a general pur- 
pose register (Examples: multiply or add instruetions) 

• memory-register operations (load operations) and register-memo- 
ry operations (store operations) used to transfer data between data 
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memory and general purpose registers. The same principle is valid 
for I/O-register (input operations) and register-I/0 operations (out- 
put operations). 

The important point for the DSP programmer is the fact, that the update 
of the processor state is performed in the same cycle. This means, that 
no pipeline delay has to be considered for this kind of instructions. 

Example 1: 



MOVI 


0x2 f , 


RO 


/* 


MOVI 


OxOa, 


Rl 


/* 


MOVI 


OxOb, 


R2 


/* 


ADD 


Rl, 


R2 


/* 


MUL 


RO, 


R2 


/* 



/* 

/* 



R0=0x2f */ 
Rl=0x0a */ 
Rl=0x0b */ 
R2=R2-hRl */ 
R2=R2*R0= */ 
(OxOa-i-OxOb) *0x2f */ 
=0xl432=5170d */ 



The results of the “MOVI OxOb, R2” instruction in R2 is available for 
the “ADD Rl, R2” without delay, furthermore the result of the “ADD 
Rl, R2” instruction is also available without delay for the multiply in- 
struction. 

ICORE uses a predict-untaken scheme for conditional branches like the 
“BGE El” instruction in the following example. 

Example 2: 



CMP 


Rl, R2 


/* 


set status reg . 


*/ 


BGE 


LI 


/* 


if R2>=R1 then LI 


*/ 


MOVI 


0x01, R3 


/* 


R3 = 1 


*/ 


MOVI 


0x05, R6 


/* 


R3 = 1 


*/ 


LI : MOVI 


0x0, R3 


/* 


R3 = 0 


*/ 



In this example the program continues without delay with the instruc- 
tions after the branch (“MOVI OxOI, R3” and “MOVI 0x05, R6”), if the 
branch is not taken. However, there is a delay after the execution of the 
branch and the branch target instruction “MOVI 0x0, R3”, if the branch 
is taken. In the case of a taken branch, the pipeline, which has already 
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loaded and decoded both “MOVI” instructions, is flushed (reset to no- 
operation instructions (NOP)). Thus, the delay between the execution of 
the taken BGE-instruction and the “MOVI 0x0, R3” is exactly 2 cycles. 

In summary, the programmer of ICORE can write functional correct 
assembly code without having to worry about pipeline delays, because 
ICORE either has no delays (like in the case of example 1) or the delays 
slots are hidden from the programmer (like in example 2). This sim- 
plifies code development, because the processor behaves according to a 
straightforward model. An implementation alternative to avoid pipeline 
flushes after branches would have been to execute the branch delay slots 
and fill them with useful instructions. The assembly code for this imple- 
mentation is much less understandable and, frequently, it is impossible 
to fill the delay slot with a useful instruction other than a NOP. Eor this 
reason a very short pipeline for ICORE has been chosen, which implic- 
itly minimizes the branch penalty. 



Mnemonic 


Description 


Behavior 


R(AREG,REGD) 


Read memory at address AREG 
and store in register REGD 


REGD 

=MEM(AREG) 


RPI(AREG,REGD) 


Like ”R” with post increment 
of AREG 


REGD 

=MEM(AREG++) 


W(REGS,AREG) 


Save register REGS in memory 
at address AREG 


MEM(AREG)= 

REGS 


WPI(REGS,AREG) 


Like ”W” but with post 
-increment of AREG 


MEM(AREG-h-) 

=REGS 


IN(AREG,REGD) 


Read input ports addressed 
by AREG 


REGD= 

INPORT(AREG) 


INPI(AREG,REGD) 


Read input ports addressed 
by AREG and increment AREG 


REGD= 

INPORT(AREG+-h) 


OUT(REGS,AREG) 


Write output port addressed 
by AREG 


OUTPORT(AREG) 

=REGS 


OUTPI(REGS, 

AREG) 


Write output port addressed 
by AREG and increment AREG 


OUTPORT(AREG) 
=REGS; AREG++ 



Table C.2: Load/Store Instructions 

































198 



Appendix C. ICORE Instruction Set Architecture 



Mnemonic 


Description 


Behavior 


LAI(CON, AREG) 


Load address register immediate 


AREG=CON 


LAIRO(CON,AREG) 


Load address register immediate 
with displacement in RO 


AREG=CON+RO 



Table C.3: Address Register Instructions 



C.3 Instruction Summary 

The ICORE instruction set can be divided into 

• 8 load/store instructions (to load (store) data from (to) the data 
memory or the EO registers) explained in detail in table C.2 

• 28 register-register instructions (performing operations on the gen- 
eral purpose register, using data values from the general purpose 
registers or the immediate field of the instructions) described in 
table C.4 

• 16 program flow control instructions (like branches, loop instruc- 
tions and instructions to wait for external events) given in table 
C.6 

• 2 address register load instructions explained in table C.3 

The column “Behavior” in the tables C.2, C.4, C.6 and C.3 contains 
a C-like description of the instruction behavior. The meaning of the 
operators in this column (e.g. “<<” or “&&”) can be looked up in any 
C language manual. Refer to the appropriate footnote on this page for 
an explanation of the coding for the “CON”-fleld of RBIT, WBIT and 
WBITI. 
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Mnemonic 


Description 


Behavior 


ABS(REG,FLG) 


Abs. value of REG, 
store sign in pred. 


FLG(REG<0):?FLG=1: 

FLG=0; 


ADD(REGS,REGD) 


Add two registers, 
store result in REGD 


REGD=REGDh-REGS 


ADDI(CON, REG) 


Add constant value, 


REG=REG-hCON 


ADDSUBO 

(REGS,REGD) 


Cond. add. or sub- 
traction dep. on the 
sign of R1 


((Rl >=0 )?(REGDh-=REGS) 
:(REGD-=REGS) 


ADDSUBl 

(REGS,REGD) 


Same as before, but 
with negated condition 


(Rl >=0)?(REGD-=REGS) 
:(REGDh-=REGS) 


AND(REGS,REGD) 


Logical ’’AND” 


REGD = REGD & REGS 


ANDI(CON,REGD) 


Logical ’’AND” 
with constant 


REGD = REGD & CON 


SAT(CON,REG) 


Saturate REG to range 
from —2^^^ to 
2CON _ ^ 


REG = REG, if in range, 
else saturation 


COROl 


Special CORDIC 
instruction 
(1. instruction 
for CORDIC loop) 


(R1>=0)?(R3h-=R6): 

(R3-=R6);R5=R7;(R1 >=0)? 

(R2h-=R4):(R2-=R4); 

(HWLXOOP==-l)? 

(R7=R2) :(R7= 

(((R2»HWLXOOP) 

+i)»D); 

R6= MEM [ARO]; 


COR2 


Special CORDIC 
instruction 
(2. instruction 
for CORDIC loop) 


(Rl<0)?(Rl-t=R5) 

:(Rl-=R5);HWLXOOP== 

-1?(R4=R1):(R4=(((R1» 

HWLXOOP)h-1)»1)); 


CMP(REGS,REGD) 


Compare registers 


set_status (REGD-REGS) 


CMPI(CON,REG) 


Compare with 
immediate value 


set_status(REG-CON) 


MOV(REGS,REGD) 


Move reg. 


REGD=REGS 


MOVI(CON,REG) 


Move immediate 


REG=CON 


MULS(REGS,REGD) 


Signed mult, of 
lower 16 bits 
of REGS and REGD 


REGD= 

((unsigned J6b) REGS 
* (unsigned _1 6b) REGD) 


MULU(REGS,REGD) 


Unsigned mult, of 
lower 16 bits 
of REGS and REGD 


REGD= 

((unsigned J6b) REGS 
* (unsigned _1 6b) REGD) 



Table C.4: Register-Register Instructions (Part 1/2) 
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Mnemonic 


Description 


Behavior 


NEG(REG) 


Negate register 


REG = -REG 


SLA(REGS,REGD) 


Arith. shift left 
(shift count in 
SLSBsofREGS) 


REGD=REGD«REGSJow5 


SLAI(CON,REG) 


Arith. shift left 


REG=REG< <CON Jow5 


SRA(REGS,REGD) 


Arith. shift right 


REGD=REGD»REGSJow5 


SRAI(CON,REG) 


Arith. shift right 


REGD=REG»CON 


SRA1(REGS, 

REGD) 


CORDIC instr.: 
(shift by nr-bl bits, 
where nr=”current 
hardware loop cnt.”, 
with rounding) 


(HWL_COUNT<= -1)? 
(REGD=REGS); (REGD= 
(((REGS > >H WL _COUNT)h- 1 ) 
»1) 


SRAll(CON,REG) 


CORDIC instr. 
for shift and round 


(CON<= -1)?(REG=REG) 

: (REG=(((REG»CON)h- 1 ) 

»1) 


SUB(REGS,REGD) 


Subtraction 


REGD=REGD-REGS 


SUBl(CON,REG) 


Subtraction of const. 


REG=REG-CON 


RBIT(CON'‘,REG) 


Extract bit field 
in REG 


RO=((REG»CONj-ight) 

&((l«CONJength)-l)) 


^wbItccon^^reg^ 


Write bit field in RO 


RO=(((RO»(CONJeft-bl)) 

< < (CONJeft-b 1 ))h-((REG& 

(( 1 « (CON Jeft-CON_right 
H-l))-l))«CONj-ight)H- 
h-(R 0&(( 1 «CON j-ight)- 1 ))) 


^WBirlCCON^^RE^ 


Write constant in 
bit field of REG 


REG=(((REG»(CONJeft-bl)) 
«(CONJeft-i-l))H-((31& 
CON_value&(( 1 « (CON deft 
-CON_right-bl ))-!)) 

< < CON_right)-b(REG&(( 1 
«CON_right)-l))) 



Table C.5: Register-Register Instructions (Part 2/2) 



'*CON is defined by the relation CON=CON_length*8 +CON. right, where CON. length and CON. right 
are 3 bit unsigned values 

^CON is defined by CON=CON_length*8 +CON_right, where CObLlength and CON.right are 3 bit 
unsigned values; CON.left = CON.right + CON_length -1 

®CON is defined by CON=64*CON_value+8*CON_length+CON_ right, where CON. value is a 5 bit field 
and CONJength and CON.right are 3 bit unsigned values; CON. left = CON. right + CON. length -1 
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Mnemonic 


Description 


Behavior 


B(CON) 


Uncond. rel. branch 


PC=PC+CON+l 


BSR(CON) 


Uncond. rel. branch 
to subroutine 


STACK=PC; 

PC=PC+CON+l; 


BE(CON) 


Rel. branch if ”=” 
cond. in status reg. 


If(STATUS.Z==l) 

PC=PC+CON+l; 


BNE(CON) 


Rel. branch if ”!=” 
cond. in status reg. 


if(STATUS.Z==0) 

PC=PC+CON+l; 


BLT(CON) 


Rel. branch if ”<” 
cond. in status reg. 


if(STATUS.L==l 
&& STATUS.L==0) 
PC=PC+CON+l 


BLE(CON) 


Rel. branch if ”<=” 
cond. in status 
register 


if(STATUS.L==l 
STATUS. Z==l) 
PC=PC+CON+l 


BGT(CON) 


Rel. branch if ”>” 
cond. in status 
register 


if(STATUS.L==0 
&& STATUS .Z==0) 
PC=PC+CON+l 


BGE(CON) 


Rel. branch if ”>=” 
cond. in status 
register 


if(STATUS.L==0 
STATUS. Z==0) 
PC=PC+CON+l 


BPC(FLG,CON) 


Rel. branch if pred. 
bit FLG is clear 


if(FLG==0) 

PC=PC+CON+l 


BPS(ELG,CON) 


Rel. branch if pred. 
bit FLG is set 


if(FLG==l) 

PC=PC+CON+l 


END 


Enter idle mode 


- 


RTS 


Return from subrout. 


PC=STACK 


SUSPG 


Wait until 
guardJrig=”l” 


" 


SUSPP 


Wait until 
ppubusjen=”l” 


" 


LPCNT(CON,REG) 


Init. loop cnt. reg. 


Ip ^tart jcount=CON ; 
Ip _end_count=REG ; 


LPINI(CON) 


Init. loop start and 
end addr. (CON) 
and activ. loop 
processing with 
next instr. 





Table C.6: Program Flow Control Instructions 
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C.4 Exceptions to the Hidden Pipeline Model 

There are some restrietions for the programmer due to the internal 
pipeline of ICORE. Restrietions are only present for the zero-overhead- 
loop proeessing, whieh is implemented by the fiow-eontrol unit (ef. Fig- 
ure C.l). 

• the instruetion “Ipent” has to be exeeuted at least 2 instruetions 
before the loop starts, when the internal loop eounter is needed 
within the loop e.g. 

LPCNT (1, 24) ; 

MOV(Rl,R2) ; 

LPINI (COR_LOOPEND) ; 

LABEL (COR_LOOPSTART) 

COR 01; 

COR2 ; 

LABEL (COR_LOOPEND) 
is a legal sequenee 

• the loop body (in the above example the instruetions COROl and 
COR2) needs at least 2 instruetions. Only one loop instruetion in 
the loop body is not supported. 

• branehes to the last loop instruetion are illegal e.g. in 

LPCNT (1, 10) ; 

MOV(Rl,R2) ; 

LPINI (LOOPEND) ; 

LABEL (LOOPSTART) 

RPI (AR0,R0) ; 

CMP(0,R0) ; 

BNE (END_BODY) ; 

MOVI (1,R0) ; 

LABEL (END_BODY) 

WPI (RO, ARl) ; 

LABEL (LOOPEND) 

the “BNE (END_BODY);” instruetion is illegal 




C.5. ICORE Memory Organization and I/O Space 
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C.5 ICORE Memory Organization and I/O Space 

ICORE uses a Harvard architecture, which means that instruction and 
data memory are separated. An instruction ROM of 2048 words with 
20 bits is used to store the program. 

The data memory is subdivided into a small (synthesized) 24 word 
ROM and a 256x32 bit RAM. Table C.7 shows the address mapping 
of the data memory. The data memory itself contains about 200 tempo- 
rary states and variables which correspond to the states defined in the 
DVB-T A&T specification. 



Memory 


Start Address 


End Address 


Synth. ROM 


0 


23 


unused 


24 


255 


RAM 


256 


511 



Table C.7: Data Memory Mapping 

The I/O-address space of ICORE is separated from the data and the 
instruction memory and uses about 40 different registers for I/O values. 



C.6 Instruction Coding 

The instruction code word is the representation of operations and 
operands in the instruction memory. Eor instance, the word 
“OOOIOODDDIIIIIIIIIH” is the instruction code word for “MOVI #1, D” 
where “I” represents an 11 -bit immediate value and D is the destina- 
tion (general purpose register R0-R7). Thus, the operation “MOVI 
011100b, R5” moves the signed binary value 011100b to the register 
R5 and has the machine coding “00010010100000011100”. 

In order to simplify the design space exploration, which involves fre- 
quent changes of the instruction coding, a tool called ICON has been 
developed for programming. This tool has been described in Subsec- 
tion 6.3.1 in this thesis. ICON is an instruction coding generator, a 
hardware generator for the decoder hardware description, and an as- 
sembler. ICON uses the assembler program in a line-oriented input file 
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and automatically generates the instruction coding, the instruction de- 
coder and the machine code (as COFF file for the ROM). The user has 
the freedom to select the preferred coding scheme for the opcodes and 
the alignment of operand fields. The remaining degrees of freedom are 
optimized to minimize the coding width and the power consumption. 
Figure C.3 shows an example input file (“.cri”-file) for ICON, which 
contains the assembler program in the following format: 

• the first line contains a description of the individual fields of the 
file, starting with the opcode and the individual operands. The 
second field, for instance, specifies the field “regO” (first general 
purpose register) which is encoded as 3bit unsigned value (“3u”), 
the fifth field specifies the 1 1-bit signed immediate value 

• the second line is a blank line 

• the following lines specify the actual program to be implemented 
starting with the operation e.g. “movijop” and the appropriate 
fields e.g. “u3reg0=5” and “sllimmediate=28” for the above ex- 
ample (“MOVI 011100b, R5”). “x” values indicate don’t care 
operand values. 



opcode 


1 u3r0 


|u3rl 


u2ar 


1 sllimm 


u2pr 


tr 

1 

o 

p 


X 


X 


X 


6 


X 


movi_op 


0 


X 


X 


0 


X 


lai_op 


X 


X 


0 


32 


X 


lpc_op 


1 


X 


X 


1 


X 


rpi_op 


2 


X 


1 


X 


X 


cmp_op 


4 


2 


X 


X 


X 


end_op 


X 


X 


X 


X 


X 



Figure C.3: Assembler Input File for ICON 




Appendix D 



Different ICORE Pipeline 
Organizations 



Figure D.l depicts the different pipeline organizations that have been 
explored during the design of ICORE. Figure D. 1 neglects many details 
of the implementation including the status and predicate register file, 
the address generator etc. It rather depicts the part of the data path that 
is needed for register-register and memory-register instructions. Im- 
plementation a) shows a two stage pipeline: the critical path of this 
implementation is typically the ID/RD/EXAVB stage. Implementation 
b) reduces this critical path by inserting an additional pipeline stage af- 
ter the decoder. This also increases the branch penalty by one cycle. 
Implementation c), finally, uses a 4-stage pipeline with a data forward- 
ing path. This implementation, which is pretty similar to many con- 
ventional RISC-processor implementations, needs additional MUXes 
in the RD/EX stage to implement the forwarding logic. For the ICORE 
benchmark, implementation b) exposed the best energy-efficiency and 
was able to meet the given timing constraints. 
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Appendix D. Different ICORE Pipeline Organizations 






c) IF ID RD/EX WB 




Figure D.l: Different Pipeline Organizations for Design Exploration 
















Appendix E 



ICORE HDL Description 
Templates 



In this chapter generic examples for HDL templates that implement reg- 
ister file instances and functional units are described. These descriptions 
can also be used and parameterized by an automatic HDL generator. 



E.l Generic Register File Entity 

The implementation of a processor register like a status or a general pur- 
pose register file can be achieved using the template register description 
in Listing E.2 together with a parameterization package according to 
Listing E.L This synthesizable register file can be parameterized to 
match any register structure that can be described with LISA. 

Parameters of this register file template are the bit width of register el- 
ements, the number of registers, and the number of individual read and 
write ports of the register. If two write ports try to write to the same ad- 
dress, this write access conflict is resolved in hardware using a prioriza- 
tion scheme (here, write ports with lower number have higher priority). 
Optionally, hardware or simulation code to detect this condition can be 
added (which has been omitted in Listing E.2 due to a lack of space). 

Write and read accesses to such a register file can be implemented by 
connecting the inputs of the write ports to the associated data sources 
using multiplexers in case of multiple sources. Due to the fact that the 
read and write ports have access to all of the available internal reg- 
isters, the hardware generator has to find an optimum assignment of 
data sources/sinks to register write/read ports. This assignment can be 
achieved using a balancing scheme, which minimizes the total multi- 
plexer area and delay of the implementation. 
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Appendix E. ICORE EIDL Description Templates 



LIBRARY ieee; 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

PACKAGE xy_reg_defs IS 
-- GENERAL PURPOSE REGISTERS 

CONSTANT xy_reg_width : integer := 32; 

CONSTANT xy_ldn_num_registers : integer := 3; 

-- modify the following, if no power of 2 is required 

CONSTANT xy_num_registers : integer := 2**xy_ldn_num_registers; 

-- number of read and write ports of reg. file 
CONSTANT xy num read p orts : integer := 5; 

CONSTANT xy num write ports : integer := 5; 

-- XY REGISTERS INTERFACES 

SUBTYPE xy_register_t is std_logic_vector (xy_reg_width- 1 DOWNTO 0); 

-- type to select registers 

SUBTYPE reg_nr_t IS unsigned (xy_ldn_num_registers-l DOWNTO 0) ; 

-- all read ports 

TYPE xy read port array t IS ARRAY {xy_num_read_ports-l DOWNTO 0) 

OF xy_register_t ; 

-- all write ports 

TYPE xy write port array t IS ARRAY (xy_num_write_ports-l DOWNTO 0) 

OF xy_register_t ; 

-- selects if write port is active and which register is written 
TYPE xy write port enable array t IS ARRAY (xy_num_write_ports-l 

DOWNTO 0) OF std_logic; 

TYPE xy write port nr array t IS ARRAY (xy num write ports -1 DOWNTO 0) 
OF reg_nr_t ; 

-- selects which register is read 

TYPE xy read p ort nr array t IS ARRAY (xy num read ports-1 DOWNTO 0) 

OF reg_nr_t ; 

-- the register file itself 

TYPE xy_register_array_t IS ARRAY (xy_num_registers-l DOWNTO 0) 

OF xy_register_t ; 

END xy_reg_defs; 



Listing E. 1 : Package with Definitions for Parameterizable Register File 



1 Read Port, 1 Write Port 
# of Registers 


2 


4 


8 


16 


32 


Area (ND2 equ. gates) 


0.61k 


1.09k 


2.10k 


3.99k 


7.79k 


1 Write Port, 8 Registers 
# of Read Ports 


1 


2 


3 


4 


5 


Area (ND2 equ. gates) 


2.10k 


2.53k 


2.99k 


3.44k 


3.90k 


2 Read Ports, 8 Registers 
# of Write Ports 


1 


2 


3 


m 


5 


Area (ND2 equ. gates) 


2.53k 


3.15k 


3.28k 


3.79k 


3.88k 



Table E.l: Area Results for Example Register File Configurations 








































E.2. Generic Bit-Manipulation Unit 
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Table E. 1 contains synthesis results for a register file with 32 bit regis- 
ters and several example configurations for the register number and the 
number of read and write ports. Compared to the area of a functional 
unit like a 16x16 bit multiplier, which consumes about 1.9k equivalent 
gates (target frequency: 200MHz), the register file area can be signifi- 
cant, if many registers are needed. From the power perspective, registers 
also consume a significant part of the total power, because they are fre- 
quently accessed in a typical load/store architecture (cf. Appendix F). 



E.2 Generic Bit-Manipulation Unit 

This section describes an application- specific functional unit with the 
purpose to read and write short bit fields within a longer (e.g. 32-bit) 
word. This so-called bit-manipulation unit is an example for a hand- 
optimized VHDF description of a functional unit, because this kind of 
operation is not available in the DesignWare-library of Synopsys [233]. 

As an example for a bit field write operation refer to Section C.3, where 
the instructions KBIT, RBITI, WBIT and WBITI are described. For 
instance in case of a WBIT instruction, the bit-manipulation unit (cf. 
VHDF-Fistings E.5, E.3 and E.4) simply replaces the bits CON jright 
to CON jright + CON length — 1 with the CON length FSBs of 
oplJnjib. 




210 



Appendix E. ICORE HDL Description Templates 



ENTITY xy_reg_file IS 
PORT ( 

clk_sysd2 , rstq : IN 

xy write port enable : IN 

xy write port regnr : IN 

xy_data_in : IN 

xy_regnr_read : IN 

xy_data_out : OUT 

xy_register : OUT 

END xy_reg_file; 



std_logic ; 

xy_write_port_enable_array_t ; 
xy_write_port_nr_array_t ; 
xy_write_port_array_t ; 
xy read p ort nr array_t ; 
xy read port array t ; 
xy_register_array_t ) ; 



ARCHITECTURE rtl OF xy_reg_file IS 
SIGNAL gpreg : xy_register_array_t ; 

BEGIN 

xy_register <= gpreg; 

gp_register_write : PROCESS (rstq, clk_sysd2) 

VARIABLE enable_bus : std_logic_vector (xy_num_registers-l 

DOWNTO 0 ) ; 

VARIABLE tmp_gpreg : xy_register_t ; 

BEGIN 

IF (rstq = ' 0' ) THEN 

gpreg <= (OTHERS => (OTHERS => '0')); 

ELSIF (clk_sysd2 ' event AND clk_sysd2 = '1') THEN 
enable_bus := (OTHERS => '0'); 

FOR i IN xy num write p orts -1 DOVfNTO 0 LOOP 
IF (xy write p ort enable (i) = '1') THEN 

enable bus(conv integer (xy write p ort regnr(i))) := '1'/ 
END IF; 

END LOOP; 

FOR i IN xy_num_registers - 1 DOWNTO 0 LOOP 
tmp_gpreg := (OTHERS => '0') ; 

FOR j IN xy num write p orts-1 DOWNTO 0 LOOP 
IF (xy write port regnr (j) = i 

AND xywrite port enable (j) = '1') THEN 
tmp_gpreg := xy_data_in ( j ) ; 

END IF; 

END LOOP; 

IF (enable_bus ( i ) = '1') THEN 
gpreg (i) <= tmp_gpreg; 

ELSE 

gpreg (i) <= gpreg (i); 

END IF; 

END LOOP; 

END IF; 

END PROCESS; 



gp_register_read : PROCESS (gpreg, xy_regnr_read) 

VARIABLE nr_var : integer range 0 to xy_num_registers-l ; 
BEGIN 

xy_data_out <= (OTHERS => (OTHERS => '0')); 

FOR i IN xy num read ports-1 DOWNTO 0 LOOP 
nr_var := conv_integer (xy_regnr_read ( i ) ) ; 
xy_data_out (i) <= gpreg (nr_var) ; 

END LOOP; 

END PROCESS; 

-- insert write collision detection here 
END rtl; 



Listing E.2: Synthesizable VHDL Description of Parameterizable RF 




E.2. Generic Bit-Manipulation Unit 
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ARCHITECTURE rtl OF ppu_bltmanip IS 
SIGNAL rbit_notwbit : std_logic; 

SIGNAL enable_imm : std_logic; 

SIGNAL immediate : immediate_value_t ; 

SIGNAL opO_in : data path t; 

SIGNAL opl_in : data path t; 

SUBTYPE mask_t IS unsigned (bitmanip_max_mod_field-l DOWNTO 0) ; 

TYPE mask_table_t IS ARRAY (bitmanip_max_mod_f ield-1 DOWNTO 0) OF mask_t; 
SIGNAL mask_table : mask_table_t ; 

SIGNAL dummy : std_logic; 

BEGIN 

bitmanip block proc : PROCESS (enable, rbit_notwbit_nb, enable_imm_nb , 

immediate_nb, op0_in_nb, opl_in_nb) 

BEGIN 

IF (enable = '0' AND use_blocked_input = 1) THEN 
rbit_notwbit <= 'O'; 

enable_imm <= 'O'; 

immediate <= (OTHERS => '0'); 

op0_in <= (OTHERS => '0'); 

opl_in <= (OTHERS => '0'); 

ELSE 

rbit_notwbit <= rbit_notwbit__nb ; 
enable_imm <= enable_imm_nb ; 

immediate <= immediate_nb; 

op0_in <= op0_in_nb; 

opl_in <= opl_in_nb; 

END IF; 

END PROCESS; 

bitmanip proc : PROCESS (rbit_notwbit, enable_imm, immediate, 
op0_in, opl_in, mask_table) 

VARIABLE value_to_insert_v : std_logic_vector (bitmanip_max_mod_f ield- 1 

DOWNTO 0) ; 

VARIABLE masked_value_v : data_path_t; 

VARIABLE shif ted_mask_v : std_logic_vector (bitmanip_max_af f ected 

DOWNTO 0 ) ; 

VARIABLE shifted_op_v : data_path_t; 

VARIABLE length_v : unsigned (bitmanip_len_f ield_len- 1 

DOWNTO 0 ) ; 

VARIABLE right_pos_v : unsigned (bitmanip__rpos_field_len-l 

DOWNTO 0) ; 

BEGIN 

length_v := conv_unsigned ( immediate ( imm_value_width- 

bitmanip_value_f ield_le-l DOWNTO 
imm_value_width-bitmanip_value_f ield_len 
-bitmanip_len_f ield_len) , 
bitmanip_len_f ield_len) ; 

right_pos_v := conv_unsigned ( immediate (bitmanip_rpos_f ield_len- 1 
DOWNTO 0) , bitmanip_rpos_f ield_len) ; 
shif ted_mask_v := conv_std_logic_vector ( shl (mask_table ( 
conv_integer ( length_v) ) , 

right_pos_v) , bitmanip_max_af f ected+1 ) ; 

IF (rbit_notwbit = '0') THEN 

IF (enable_imm = '1') THEN -- wbiti 

value_to_insert_v : = conv_std_logic_vector (conv_unsigned ( immediate ( 
imm_value_width- 1 DOWNTO imm_value_width-bitraanip_value_f ield_len) , 
bitmanip_value_f ield__len) , bitmanip_max_mod_f ield) ; 

ELSE -- wbit 

value_to_insert_v := opl_in (bitmanip_max_mod_f ield-1 DOWNTO 0) ; 

END IF; 



Listing E.3: Synthesizable VHDL Architecture of Bit-Manipulation Unit 1/2 
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Appendix E. ICORE EIDL Description Templates 



-- mask significant bits 
masked_value_v := (OTHERS => '0')/ 

masked_value_v (bitmanip_max_mod_f ield- 1 DOWNTO 0 ) : = 

value_to_insert_v AND 

conv_std_logic_vector (mask_table (conv_integer (length_v) ) , 
bitmanip_max_mod_f ield) ; 

-- shift masked_value_v to the left by right_pos_v bits 
masked_value_v := conv_std_logic_vector (shl (UNSIGNED (masked_value_v) , 
right_pos_v) , data p athwidth) ; 

-- combine results 

result <= (op0_in AND (NOT conv_std_logic_vector ( 

UNSIGNED ( shifted_mask_v) , data path width) ) ) 

OR masked_value_v; 

ELSE -- rbit 

-- shift right 

shifted_op_v := conv_std_logic_vector (shr (UNSIGNED (op0_in) , 
right_pos_v) , data p ath width) ; 

-- mask 

result <= shifted_op_v AND 

conv_std_logic_vector (mask_table ( 
conv_integer (length_v) ) , data path width) ; 

END IF; 

END PROCESS; 

gen mask proc : PROCESS (dummy) 

BEGIN 

mask_table <= (OTHERS => (OTHERS => '0')); 

FOR j IN bitmanip_max_mod_f ield-2 DOVfNTO 0 LOOP 
FOR i IN j DOWNTO 0 LOOP 
mask_table ( j +1 ) (i) <= '1'; 

END LOOP; -- i 
END LOOP; -- j 

mask_table (0) <= (OTHERS => '1'); 

END PROCESS; 

END rtl; 



Listing E.4: Synthesizable VHDL Architecture of Bit-Manipulation Unit 2/2 



ENTITY ppu bitmanip IS 






PORT ( 








enable 


IN 


std logic; 




rbit notwbit nb 


IN 


std logic; 


-- '0' = wbit instruction 


enable imm nb 


IN 


std logic; 


-- for wbiti 


immediate nb 


IN 


immediate value t; 


-- immediate from instruct. 


opO in nb 


IN 


data path t; 


-- operand from register 


opl in nb 


IN 


data path t; 


-- "value field" for wbit 


result 

\ . 


OUT 


data path t 


-- update for rO 


J / 

END ppu bitmanip; 









Listing E.5: Synthesizable VHDL Entity of Bit-Manipulation Unit 






Appendix F 



Area, Power and Design Time 
for ICORE 



Figure F.l depicts the area breakdown of the ICORE design module, 
which contains the processor core, instruction and data memory as well 
as an I/O processor. The total area of this design module is about 0.65 
sq. mm. 



Shifter 

4 , 0 % 

Bit- 



manipulation 

0 , 7 % 

Saturation 




Figure F.l: ICORE Area Breakdown 



A significant amount of the area is consumed by design entities that 
require a large amount of area-intensive D-Flipflops like the general 
purpose register and the FO processor. The functional units of the pro- 
cessor consume less than 15% of the total area. 
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Appendix F. Area, Power and Design Time for ICORE 



In Figure F.2 the power consumption of the ICORE module for all the 
design entities as well as the clock tree is shown. The total power of the 
complete design module is 18.8mW. 

Compared to other processor-based systems, the power consumption in 
the clock tree is significantly lower, due to the extensive use of clock 
gating. This is especially efficient, because the switching activity of 
many register intensive design entities is low: For instance, a large part 
of the FO controller is used in order to store mainly static information 
that configures the operating mode of the design. Moreover, it has to 
be mentioned that the power consumption of the instruction memory 
in Figure F.2 is measured after the optimization process that has been 
described in Section 6.3.1. The power in the instruction ROM is still 
significant despite of this optimization, because the memory is accessed 
in each clock cycle. 



General Purpose Register File 
17,9% 





Shifter 
/ 2,9% 

ALU 

/1,7% 

Decoder 
/ / 3,6% 

Addsub 
„ 0,1% 

Global Controller 
1,7% 

Saturation 

0 , 8 % 

BItmanipulatlon 

1 , 1 % 

Data Address Generator 
0,4% 



■ Status Register 
0,0% 
Multiplier 
3,1% 

Routing MUXes 
9,3% 

X Misc. 
'2,1% 



Figure F.2: ICORE Power Breakdown for a Typical Operation Scenario 



The total design time of ICORE was 10.5 man months. Eigure E.3 
shows the percentage of the design time for the different design tasks. 
The time for the design space exploration is included in the assembly 
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programming and HW description design tasks. The time for all the ver- 
ification tasks in this case study is significant with about 45% of the to- 
tal design time. An optimizing (and functionally correct) HLL compiler 
would have considerably decreased the time for assembly programming 
and SW verification. 




Figure F.3: Distribution of ICORE Designtime 



Appendix G 



Acronyms 



A 


silicon area 


AD/DA 


analog digital / digital analog converter 


ALICE 


here: name of a parameterizable processor ar- 
chitecture with compiler support 


ALU 


arithmetic logic unit 


API 


application programming interface 


ASI 


application specific instruction 


ASIC 


application-specific integrated circuit 


ASIP 


application-specific instruction set processor 


ASPP 


application-specific programmable processor 


ATM 


asynchronous transfer mode 


BE 


bit line in a memory array 


CDEG 


control data flow graph 


CLIW 


configurable long instruction word 


CMOS 


complementary metall oxide semiconductor 


COEDM 


coded orthogonal frequency division multiplex 


COEE 


common object file format 


CORDIC 


coordinate rotation digital computer 


COSY 


compiler design system (ACE) 


CPU 


central processing unit 


DBS 


data encryption standard 


DEG 


data flow graph 


DET 


discrete fourier transformation 


DMA 


direct memory access 


DOA 


direction-of-arrival (multi-antenna systems) 


DSP 


digital signal processing or digital signal pro- 
cessor 


DSSP 


domain specific signal processors 


DVB-T 


terrestrial digital video broadcasting 


E 


energy 
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Appendix G. Acronyms 



EDA electronic design automation 

EPIC explicitly parallel instruction computing 

EV eigenvector or eigenvalue 

EVD eigenvalue/eigenvector decomposition 

EET fast fourier transformation 

EIR finite impulse response (filter) 

EPGA field-programmable gate array 

ESM finite state machine 

EU functional unit of a processor 

GCC GNU C compiler 

GP general purpose 

GPR general purpose register 

GSM global system for mobile communications 

GUI graphical user interface 

HDE hardware description language 

HEE high level language 

HW hardware 

ICORE ISS-core (the first ASIP designed at the Institute 

for Integrated Signal Processing Systems) 

IDET inverse discrete fourier transformation 

HR infinite impulse response (filter) 

lEP instruction level parallelism 

IP intellectual property 

IPO in-place optimization 

IRQ interrupt request 

ISA instruction set architecture 

ISI inter-symbol interference 

ISS instruction set simulator 

JPEG joint photographic experts group 

EISA language for instruction set architecture de- 

scription (ISS) 

EMS least mean square 

ESB least significant bit/byte 

MIMD multiple instruction, multiple data (architec- 
ture) 

MIPS million instructions per second, commercial 
processor architecture 

MISD multiple instruction, single data (architecture) 
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MOS 

MPEG 

MSB 

ND2 

NMOS 

NOP 

NP 

OFDM 

PAST 

PC 

PEQ 

PSA 

PTE 

QAM 

RAM 

RE 

RISC 

ROM 

RTF 

SIMD 

SISD 

SM 

SPARC 

SPICE 

SRAM 

SVD 

T 

TCG 

VHDE 

VHSIC 

VEIW 

VESI 

XOR 



metall oxide semiconductor 

motion pictures experts group 

most significant bit/byte 

NAND standard cell with two inputs 

metall oxide semiconductor with N-field effect 

transistors 

no-operation instruction 
network processor 

orthogonal frequency division multiplexing 
projection approximation subspace tracking 
program counter 
phase equalization 
programmable system architecture 
processor template library 
quadrature amplitude modulation 
random access memory 
register file 

reduced instruction set computer 
read only memory 
register transfer language 
single instruction, multiple data (architecture) 
single instruction, single data (architecture) 
signed magnitude (number representation) 
scalable processor architecture 
simulation program with integrated circuit em- 
phasis 

static random access memory 
singular value decomposition 
critical path of a synchronous design 
test case generator 

VHSIC hardware description language 
very-high-speed integrated circuits 
very long instruction word 
very large scale integration 
exclusive logical OR operation 
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